I blog for a while now. Looking back I had a personal blog about things I'm interested in some years during my study. I did a comedic fake news page, too. My wife and I write a blog about our baby and I also have an IBM internal blog about SAN troubleshooting. Last year I started with seb's sanblog on developerworks and it was quite a slow start. Beginning of 2011 there was much stuff to do for my primary job on one hand but on the other hand my daughter was born and my interests shifted a bit. As I write the articles for this blog mainly during my spare time, the simple equation was: no spare time = no blog posts.
Midyear 2011 the situation improved a bit. My baby Johanna was out of the woods somehow (is "to be out of the woods" really the English term for finishing the most stressful phase?) after her hip dysplasia was cured and I was able to really start to blog. And then I thought about: What do you want to blog about? There is so much going on in the storage industry, but am I really the best person to blog about them? Can I really add some value with blog articles here? I don't think so. Of course I comment on such topic on other people's blogs, twitter or social platforms like linkedin from time to time. After all there's always some FUD around I cannot resist to comment. But I try to keep my own blog really about SAN and storage virtualization with a focus on troubleshooting.
I wrote 19 articles in 2011. That's not much compared to let's say storagebod. Why is that so? Well, for me it's quite a balancing act what I can blog about. Of course I can't blog about a specific customer having a problem. That's a no-go. There are also things I don't want to blog about because there is already much out there about it. And then there is stuff that I just can't blog about, because it's internal information. Special troubleshooting procedures I created for example or information about internal tools and projects I'm involved.
What remains then?
Oh, there's still enough to blog about. If I notice situations like "Hey, I explained this general thing in four cases now to customers completely unaware of it." or if I see a feature that could really help admins but hardly anyone uses it so far, then I write a blog article. I see it more as an additional explanation and food for thought. My target audience consists of customers on the "doing level" (admins, architects) as well as people troubleshooting SANs. I know that's a significantly smaller group than the audience of the more general storage bloggers, but I'm happy if the right people read it and I get the feedback that my blog helped them with their problems. However I started to count the visitors internally since end of July and so far around 32000 visited seb's sanblog. That's not too bad, I think.
Writing such a résumé I want to thank the people who inspired me to start a blog. First of all there are Barry Whyte and Tony Pearson with their developerworks blogs showing me: there are actually IBMers out there writing about my topics of interest. Reading their blogs brought me to many others - also from other companies - that I try to look in daily. Most of them you see in the list on the right bar of this blog. But a special Thank you! goes out to my Australian colleague Anthony Vandewerdt whose blog has a big focus on the people really working with IBM storage products and therefore SAN products as well. His Aussie Storage Blog on developerworks triggered my decision to start an own external blog. Thank you again!
So what to expect from 2012?
To be honest, I have no idea :o) There is no overall plan. No weeks-long article pipeline. I'm not invited in blogger events or something like that and my blog is in no way a marketing channel for upcoming IBM products. Everything I write is just born out of my experience with SAN products and troubleshooting. I try not to write too much about hypes and trends, except it has a direct impact on SAN - like oversaturated hypervisors turning to slow drain devices or Big Data as an excuse to do some really weird things with your storage architecture :o)
Are you still interested?
Then be my guests in 2012 and if you feel the urge to say something about, against or additional to an article, don't hesitate to leave a comment! Have a nice start into the New Year!
Performance problems are still the most malicious issues on my list. They come in many flavors and most of them have two things in common: 1) They are hardly SAN defects and 2) They need to be solved as quickly as possible, because they really have an impact.
If just a switch crashed or an ISL dropped dead or even an ugly firmware bug blocks the communication of an entire fabric, it might ring all alarm bells. But that's something you (hopefully) have your redundancy for. Performance problems on the other hand can have a high impact on your applications across the whole data center without a concerning message in the logs, if your systems are not well prepared for it. Beside of the preparation steps I pointed out here there is a tool in Brocade's FabricOS especially for performance problems: The bottleneck monitor or short:
If a performance problem is escalated to the technical support the next thing most probably happening is that the support guy asks you to clear the counters, wait up to three hours while the problem is noticeable, and then gather a supportsave of each switch in both fabrics.
Why 3 hours?
A manual performance analysis is based on certain 32 bit counters in a supportsave. In a device that's able to route I/O of several gigabits per second, 32 bits aren't a huge range for counters and they will eventually wrap if you wait too long. But a wrapped counter is worthless, because you can't tell if and how often it wrapped. So all comparisons would be meaningless.
Beside the wait time the whole handling of the data collections including gathering and uploading them to the support takes precious time. And then the support has to process and analyze them. After all these hours of continously repeating telephone calls you get from management and internal and/or external customers, the support guy hopefully found the cause of your performance problem. And keeping point 1) from my first paragraph in mind, it's most probably not even the fault of a switch*). If he makes you aware to a slow drain device, you would now start to involve the admins and/or support for the particular device.
You definitely need a shortcut!
And this shortcut is the bottleneckmon. It's made to permanently check your SAN for performance problems. Configured correctly it will pinpoint the cause of performance problems - at least the bigger ones. The bottleneckmon was introduced with FabricOS v6.3x and some major limitations. But from v6.4x it eventually became a must-have by offering two useful features:
Congestion bottleneck detection
This just measures the link utilization. With the fabric watch license (pre-loaded on many of the IBM-branded switches and directors) you can do that already for a long time. But the bottleneckmon offers a bit more convenience and brings it in the proper context. The more important thing is:
Latency bottleneck detection
This feature shows you most of the medium to major situations of buffer credit starvation. If a port runs out of buffer credits, it's not allowed to send frames over the fibre. To make a long story short if you see a latency bottleneck reported against an F-Port you most probably found a slow drain device in your SAN. If it's reported against an ISL, there are two possible reasons:
- There could be a slow drain device "down the road" - the slow drain device could be connected to the adjacent switch or to another one connected to it. Credit starvation typically pressures back to affect wide areas of the fabric.
- The ISL could have too few buffers. Maybe the link is just too long. Or the average framesize is much smaller than expected. Or QoS is configured on the link but you don't have QoS-Zones prioritizing your I/O. This could have a huge negative impact! Another reason could be a mis-configured longdistance ISL.
Whatever it is, it is either the reason for your performance problem or at least contributing to it and should definitely be solved. Maybe this article can help you with that then.
With FabricOS v7.0 the bottleneckmon was improved again. While the core-policy which detects credit starvation situations was pretty much pre-defined before v7.0 you're now able to configure it in the minutest details. We are still testing that out more in detail - for the moment I recommend to use the defaults.
So how to use it?
At first: I highly recommend to update your switches to the latest supported v6.4x code if possible. It's much better there than in v6.3! If you look up bottleneckmon in the command reference, it offers plenty of parameters and sub-commands. But in fact for most environments and performance problems it's enough to just enable it and activate the alerting:
myswitch:admin> bottleneckmon --enable -alert
That's it. It will generate messages in your switch's error log if a congestion or a latency bottleneck was found. Pretty straightforward. If you are not sure you can check the status with:
myswitch:admin> bottleneckmon --status
And of course there is a show command which can be used with various filter options, but the easiest way is to just wait for the messages in the error log. They will tell you the type of bottleneck and of course the affected port.
And if there are messages now?
Well, there is still the chance, that there are actually situations of buffer credit starvation the default-configured bottleneckmon can't see. However as you read an introduction here, I assume you just open a case at the IBM support.
You'll Never Walk Alone! :o)
*)Depending on country-specific policies and maintenance contracts a performance analysis as described above could be a charged service in your region.
There are some goodies in FOS 7.0 that are not announced big-time. Goodies especially for us troubleshooters. There are regular but not too frequent so called RAS meetings. Here we have the possibility to wish for new RAS features - wishes born out of real problem cases. Some of the wishes we had were implemented in FOS 7.0 (beside of the Frame Log I already described in a previous post).
Time-out discards in porterrshow
You probably noticed that I have a hobbyhorse when it comes to troubleshooting in the SAN: performance problems. Medium to major SAN-performance problems usually go along with frame drops in the fabric. If a frame is kept in a port's buffer for 500ms, because it can't be delivered in time, it will be dropped. So these drops would be a good indicator for a performance problem. There is a counter in portstatsshow for each port (depending on code version and platform) named er_tx_c3_timeout, which shows how often the ASIC connected to a specific port had to drop a frame that was intended to be sent to this port. It means: This guy was busy X times and I had to drop a frame for him.
But who looks in the portstatsshow anyway? At least for monitoring? In that area the porterrshow command is way more popular, because it provides a single table for all FC ports showing the most important error counters. Unfortunately it had only one cumulative counter for all reasons of frame discards - and there are a lot more beside of those time-outs. But now there are two additional counters in this table: c3-timeout tx and c3-timeout rx. Out of them the tx counter is the important one as described above. The rx counter just gives you an idea where the dropped frames came from.
So: just focus on the TX! If it counts up, get some ideas how to treat it here.
The firmware history
Just last week I had a fiddly case about firmware update problems again. There are restrictions about the version you can update to based on the current one. If you don't observe the rules, things could mess up. And they could mess up in a way you don't see straightaway. But then suddenly, after some months and maybe another firmware update, the switch runs into a critical situation. Or it has problems with exactly that new firmware update. Some of these problems can render a CP card useless, which is ugly because from a plain hardware point of view nothing is broken. But the card has to be replaced at the end. Sigh.
To make a long story short: Wouldn't it be better to actually know the versions the switch was running on in the past? And that's the duty of the firmware history:
switch:admin> firmwareshow --history
Firmware version history
Sno Date & Time Switch Name Slot PID FOS Version
1 Fri Feb 18 12:58:06 2011 CDCX16 7 1556 Fabos Version v7.0.0d
2 Wed Feb 16 07:27:38 2011 CDCX16 7 1560 Fabos Version v7.0.0a
(example borrowed from the CLI guide)
No access - No problem
There is a mistake almost everybody in the world of Brocade SAN administration makes (hopefully only) once: Trying to merge a new switch into an existing fabric and fail with a segmented ISL and a "zone conflict". Then the most probable reason is that the new switch's default zoning (defzone) is set to "no access".
This feature was introduced a while ago to make Brocade switches a little more safe. Earlier each port was able to see every other port as long as there was no effective zoning on the switch. With "no access" enabled, all traffic between each unzoned pair of devices is blocked if there is no zone including them both. The drawback of "no access" is its technical implementation, though. As soon as it was enabled a hidden zone was created and its pure existence blocked the traffic for all unzoned devices. And so without any indication the switch did end up with a zone.
But entre nous: no sane person accepts this without raising a few eyebrows. With FOS 7.0 this (mis-)behavior is gone. The new switch has a "no access" setting and wants to merge the fabric? Fine. You don't have to care, the firmware cares for you!
Thanks for the little helpers Brocade - and I hope you stay open for new ideas :o)
Many of you (at least many of the few really reading this stuff) may already know what CRC is. But I think it doesn't hurt to have a short recap. CRC means Cyclic Redundancy Check and can be used as an error detection technique. Basically it calculates a kind of hash value that tends to be very different if you change one or more bits in the original data. Beside of that it's quite easy to implement. I once wrote a CRC algorithm in assembler (but for the Intel 8008) during my study and it was a nice exercise for optimization.
What has that got to do with SAN?
In Fibre Channel we calculate a CRC value for each frame and store it as the next-to-last 4 bytes before the actual end of frame (EOF). The recipient will read the frame bit by bit and meanwhile it calculates the CRC value by itself. Reaching the end of the frame it knows if the CRC value stored there matches the content of the frame. If this is not the case, it knows that there was at least one bit error and it is supposed to be corrupted and thus can be dropped. Now if the recipient is a switch the next thing to happen depends on which frame forwarding method is used:
The switch reads the whole frame into one of its ingress ("incoming") buffers and checks the CRC value. If the frame is corrupted the switch drops it. It's up to the destination device to recognize that a frame is missing and at least the initiator will track the open exchange and starts error recovery as soon as time-out values are reached. Many of the Cisco MDS 9000 switches work this way. It ensures that the network is not stressed with frames that are corrupted anyway, but it's accompanied with a higher latency. From a troubleshooting point of view the link connected to the port reporting CRC errors is most probably the faulty one.
To decrease this latency the switch could just read in the destination address and as soon as that one is confirmed to be zoned with the source connected to the F-port (a really quick look into the so called CAM-table stored within the ASIC) it goes directly on the way towards the destination. So if everything works fine - enough buffer-credits are available - the frame's header is already on the next link before the switch even read the CRC value. The frame will travel the whole path to the destination device even though it's a corrupted frame and all switches it passes will recognize that this frame is corrupted. Brocade switches work this way. As soon as the corrupted frame reaches the destination, it will be dropped.
Regardless which method is used, the CRC value remains just an error detection and most probably the whole exchange has to be aborted and repeated anyway.
So how to troubleshoot CRC errors on Brocade switches then?
If you would only have a counter for CRC errors, you would be in trouble now. Because if all switches along the path increase their CRC error counter for this frame, how would you know which one is really broken? If you have multiple broken links in a huge SAN, this could turn ugly. But there are 2 additional counters for you:
- enc in - The frame is encoded additionally in a way that bit errors can be detected. And because the frame is decoded when it's read from the fiber and encoded again before it's sent out to the next fiber, the enc in (encoding errors inside frames) counter will only increase for the port that is connected to the faulty link.
- crc g_eof - Although a corrupted frame will be cut-through as explained above, there is just one thing the switch can do in addition when it encounters a mismatch between the calculated CRC value and the one stored in the frame: it will replace the EOF with another 4 bytes meaning something like "This is the end of the frame, but the frame was recognized as corrupted." The crc g_eof counter basically means "The CRC value was wrong but nobody noticed it before. Therefore it still had a good EOF." So if this counter increases for a particular link, it is most probably faulty.
frames enc crc crc too too bad enc disc link loss loss frjt fbsy
tx rx in err g_eof shrt long eof out c3 fail sync sig
1: 1.5g 1.8g 13 12 12 0 0 0 1.1m 0 2 650 2 0 0
2: 1.3g 1.4g 0 101 0 0 0 0 0 0 0 0 0 0 0
3: 1.9g 2.9g 82 15 0 0 3 12 847 0 0 0 0 0 0
Port 1 shows a link with classical bit errors. You see CRC errors and also enc in errors. Along with them you see
crc g_eof. Everything as expected. Just go ahead and and check / clean / replace the cable and/or SFPs. There are some tests you could do to determine which one is broken like "porttest" and "spinfab".
Port 2 is a typical example of an ISL with forwarded CRC errors. This ISL itself is error-free. It just transported some previously corrupted frames (crc err but no enc in) which were already "tagged" as corrupted, hence no crc g_eof increases.
Port 3 is a bit tricky now. If you just rely on crc g_eof it seems to be a victim of forwarded CRC errors, too. But that's not the case. Actually they were broken in a manner that the end of the frame was not detected properly, so too long an bad eof is increased. Best practice: Stick with the enc in counter. It still shows that the link indeed generates errors.
Hold on, Help is on the way!
Now with 16G FC as state of the art things changed a bit. It uses a new encoding method and it comes with a forward error correction (FEC) feature. Brocade provides this with its FabricOS v7.0x on the 16G links. It will be able to correct up to 11 bits in a full FC frame. FEC is not really highlighted or specially standing out in their courses and release notes, but in my opinion this thing is a game changer! Eleven bit errors within one frame! Based on the ratio between enc in and crc err - which basically shows how many bit errors you have in a frame on the average - we see so far, I assume this to just solve over 90% of the physical problems we have in SANs today. Without the end-device-driven error recovery which takes ages in Fibre Channel terms. Less aborts, less time-outs, less slow drain devices because of physical problems! If this works as intended SANs will reach a new level of reliability.
So let's see how this turns out in the future. It might be a bright one! :o)
Modified on by seb_
The Storwize V7000 and the SVC (SAN Volume Controller) share the same code base and therefore the same error codes. Many of them indicate a failure condition in this very machine, but there are others just pointing to an external problem source. The error 1370 is one of the second kind. There is not really much information about it in the manuals but in fact it could give you a good understanding about what's going wrong.
As storage virtualization products the SVC and the V7000 - if you use it to virtualize external storage - are actually the hosts for the external storage. Speaking SCSI they are the initiators and the external backend storage arrays are the targets. Usually the initiators monitor their connectivity to the targets and do the error recovery if necessary. And so the SVC and the V7000 focus on monitoring the state of their backend storage and can actually help you to troubleshoot them.
So you have 1370 errors, what now?
They come in two flavors: The event id 010018 (against an mdisk) and the event id 010030 (against a controller - aka storage array). I'll explain the 010030 as it's easier to understand but understanding it will give the insight to understand the 010018, too.
If you double-click the 1370 in your event log, you see the details of the error:
You see the reporting node and the controller the error is reported against. But the most important thing is the KCQ. The Sense Key - Code - Qualifier.
Imagine this situation: The SVC is the initiator. It sends an I/O towards the storage device - the target. But the target faces a "note-worthy" condition at the very moment. So it will make the initiator aware of it by sending a so called "check condition". As curious as it is, the initiator wants to know the details and requests the sense data. These sense data will now be stored in - you already guess it - a 1370 in the format Key - Code - Qualifier. Often the last both are referred to as ASC (Additional Sense Code; the green one) and ASCQ (Additional Sense Code Qualifier; the blue one).
Where's the Rosetta Stone?
These sense data can be translated using the official SCSI reference table by Technical Commitee T10 (the council making the SCSI protocol). If you encounter an ASC/ASCQ combination in a 1370 that can't be found in that list, it's most probably a vendor specific one. In that case the manufacturer of the target device could give you more information about it.
Back to our example. So you see the ASC 29 (the "Code") and the ASCQ 00 (the "Qualifier") here. Looking that up in the list reveals: It's a "POWER ON, RESET, OR BUS DEVICE RESET OCCURRED". This so called "POR" should make you aware that the target was recently either powered on or did a reset. Usually the initiator gets this with the first I/O it does against the target after such an event, to be aware that any open I/O it has against this target is voided and has to be repeated.
Ah, okay. That's it?
No! You see the orange box? This is the time since this sense data was received. The unit is 10ms, so this number actually represents a long time since there really was a POR for this controller.
So why do we have a 1370 today?
The 1370 is more of a container for sense data. The number behind the attributes show the "slot". So the information visible here are for the first slot and as such a long time passed since it occurred it's meaningless for us now. Let's scroll down a bit:
In the second slot you see what's really going wrong within the external storage device at the moment, because the time value is 0. That means the 1370 was triggered because of it. And it contains a different set of sense data. ASC 0C / ASCQ 00! If you try to look it up in the list, you will find 0C/00, but hey - this cannot be! The combination 0C/00 means "WRITE ERROR", but it's not defined for "Direct Access Block Devices" like storage arrays.
A Dead End?
No, of course not. In this example the storage is a DS4000. Just download the DS4000 Problem Determination Guide and it will provide an ASC/ASCQ table. Here you'll see that 0C 00, together with the Sense Key 06 (the red circle) means "Caching Disabled - Data caching has been disabled due to loss of mirroring capability or low battery capacity."
Running without the cache in the backend storage could lead to severe performance degradation and should definitely be troubleshooted! Without even looking into the backend storage you already know what's going wrong there! No need to involve SVC or V7000 support this time. Just focus on the backend storage and find out why the caching is disabled.
So please don't shoot this messenger, it just tries to help you!
Update - December 2nd 2013
The SCSI Interface Guide for IBM FlashSystem can be found here.
Time for another piece of my little series! This time I'd like to write about a new feature in v7.0x especially for administrators and support personnel: The Frame Log. Maybe it's a bit early to write about it, because it seems to be a feature "in development" at the moment, but I did wait for it so long I'm just not able to resist. I think and I hope Brocade will further develop it like the bottleneckmon - which I was very sceptical about in its first version when it was released in the v6.3 code. After seeing its functionality being extended on v6.4 and even more in v7.0, the bottleneckmon is an absolute must-have.
Hmm... maybe I should write an article about bottleneckmon, too :o)
Back to the Frame Log. So what's that?
Basically it is a list of frame discards. There are several reasons why a switch would have to drop a frame instead of delivering it to the destination device. One of them is a timeout. If a frame sticks in the ASIC (the "brain" behind the port) for half a second, the switch has to assume that something's going wrong and so the frame cannot be delivered in time anymore. Then it drops it. Till FabOS v7.0 it just increased a counter by one. Since later v6.2x versions it was at least logged against the TX port (the direction towards the reason for the drop) - in earlier versions the counter increased only for the origin port, which made no sense at all. But now we even have a log for it! A log to store all the frames the switch had to discard. While that sounds a bit like rummaging through the switch's trash bin, the Frame Log is very useful for troubleshooting though. It contains the exact time, the TX and the RX port (keep in mind the TX is the important one) and even information from the frame itself. In the summary view you see the fibrechannel addresses of the source device (SID) and of the destination device (DID).
For example to see the two most recent frame discards in summary mode, just type:
B48P16G:admin> framelog --show -mode summary -n 2
Fri Sep 23 16:07:13 CET 2011
Log TX RX
timestamp port port SID DID SFID DFID Type Count
Sep 29 16:02:08 7 5 0x040500 0x013300 1 1 timeout 1
Sep 29 16:04:51 7 1 0x030900 0x013000 1 1 timeout 1
In the so called "dump mode" you even see the first 64 bytes of each frame. Usually I have to bring an XGIG tracer onsite to catch such information and often it's not even possible to catch it then, because an XGIG can only trace what's going through the fibre. So you'll only see this frame if you trace a link it crosses before it is dropped. And even then you can't trigger (=stop) the tracer directly on this event, but you have to have it looking for a so called ABTS (abort sequence). If a frame is dropped the command will time out in the initiator and it will send this ABTS. Depending on what frame exactly was dropped in what direction, the ABTS could be on the link several minutes after the actual drop of the frame. Imagine a READ command being dropped. The error recovery will start after the SCSI timeout which could be e.g. 2 minutes. But 2 minutes is a long time in a FC trace. Chances are good that the tracer misses it then.
Not so with the Frame Log!
The frame log can tell you exactly which frame was dropped. If you try to find out if a particular I/O timeout in your host was caused by a timeout discard in the fabric, this is your way to go. If you see your storage array complaining about aborts for certain sequences, just look them up in the Frame Log. With this feature Brocade finally catches up with Cisco and their internal tracing capabilities - and Brocade does it way more comfortable for the admin. The logging of discarded frames is enabled by default and it works on all 8G and 16G platform switches without any additional license.
The big "BUTs"
As I mentioned at the beginning of this article there are still things for Brocade to work on to turn the Frame Log into a must-have tool like the bottleneckmon. The first catch is its volatility. In the current version it can only keep 50 frames per second on an ASIC base for 20 minutes in total. At the moment I personally think that's too short. But I'll wait for the first cases where I can use it before I forge an ultimate opinion about this limit.
The other - more concerning - constraint is that it only works for discards due to timeout at the moment. So if a frame is dropped because of one of all the other possible reasons, it won't be visible in the Frame Log in its current implementation. But that's exactly what I need! If the switch discards a frame because of a zone mismatch or because the destination switch was not reachable or because the target device was temporarily offline or whatever - I want to see that. If a server is misconfigured (uses wrong addresses) and so cannot reach its targets, you'd see the reason right there in the target log - no tracing needed! There are plenty of other situations that would be covered with such a functionality. So I honestly hope that there is a developer with a concept like this in his drawer or even already within its implementation. Allow me to assure you that there is at least one support guy waiting for it...
The picture is from Zsuzsanna Kilian. Thank you!
Brocade recently released its 16G platform switches and along with them a new major version of FabricOS: FOS 7.0. Beside the new features customer's admins, architects or end-users might be interested in, I see some nice enhancements and new tools for us support people, too. In the next blog posts I would like to present some of them and show how to use them, why they are important and where they apply.
The first one I want to write about is the D-Port or Diagnostics Port. This is a special mode every port on Brocade's 16G platform can be configured to.
Why should I use it?
Imagine a two-fabric setup, both spread over two locations, connected via some trunked ISLs through a DWDM. Every once in a while I get a case like this where there was a problem with one of these ISLs. Usually the end-users report major performance problems, there might even be crashes of hosts. The SAN admin looks into his switches, the server admins look for their messages against their HBAs and quickly they notice that the problem seems to be in one fabric only and having a redundant second fabric available the decision is made: "Let's block the ISLs in the affected fabric. The workaround is effective, the situation calms down, the business impact disappears. But of course there is no redundancy anymore and the next step is to find out what happened and subsequently it has to be resolved.
So a problem case is opened at the technical support. The first request from the support people will be to gather a supportsave. Often they even request to clear the counters and wait some time before gathering the data.
But it's useless now!
Of course it's most important to stop any business impact by implementing a workaround as quickly as possible, but if I get a data collection like this, it's like being asked to heal a disease on the basis of a photo of an already dead person. Usually no customer will allow to re-enable the ISLs before the cause of the problem is found and solved. Welcome to a recursive nightmare! :o)
That's where D-Ports come into play
Having Diagnostics ports on both sides of the link will allow you to test a connection between two switches without having a working ISL. This means there will be no user traffic and also no fabric management over this link and so there will be no impact at all. From a fabric perspective, the ISL is still blocked. It comes with several automatic tests:
- Electrical loopback - (only with 16G SFP+) tests the ASIC to SFP connection locally
- Optical loopback - (with 16G SFP+ and 10G SFP+) tests the whole connection physically.
- Link traffic test - (with 16G SFP+ and 10G SFP+) does latency and cable length calculation and stress test
So this can even help you to determine the right setup for your long distance connection!
How to do it?
Although it's very easy to set this up in Network Advisor (only supported with 16G SFP+), as a support member I prefer stuff to be done via CLI, because then I can see it in the CLI history. (By the way, a real accounting or audit log covering both CLI and GUI actions would be very useful. I look at you, Brocade!) At first you should know which are the corresponding ports in the two switches. (The Network Advisor would do that for you.) Then you disable them on both sides using
Once disabled you can configure the D-Port:
portcfgdport --enable port
And finally enable it again using
Of course you would do that on both sides. There's a seperate command to view the results then:
B6510_1:admin> portdporttest --show 7
Remote WWNN: 10:00:00:05:33:69:ba:97
Remote port: 25
Start time: Thu Sep 15 02:57:07 2011
End time: Thu Sep 15 02:58:23 2011
Test Start time Result EST(secs) Comments
Electrical loopback 02:58:05 PASSED -- ----------
Optical loopback 02:58:11 PASSED -- ----------
Link traffic test 02:58:18 PASSED -- ----------
Roundtrip link latency: 924 nano-seconds
Estimated cable distance: 1 meters
If you see the test failing, you have your culprit and based on which one is failing actions can be defined to resolve the problem. Your IBM support will of course help you with that! :o)
So if you face similar problems and you are already using 16G switches with 16G SFP+ installed, feel free to implement a workaround like blocking the ISLs to lower the impact. The D-Port will help to find out the reasons afterwards.
But if you are still on 4G or 8G hardware and you want to disable the most probable guilty ports, then please PLEASE get me a supportsave first!
Better: Clear the counters, wait 10 minutes and then gather a supportsave before you disable the ports. And even better than that: Clear counters periodically as described here.
There is an interesting discussion ongoing in the Linkedin group The Storage Group. The question is "What is the REAL cost of Fibre Channel?". To my surprise the participants in this discussion relatively quickly came to the conclusion that the problem is over-provisioning resp. under-utitization. My personal opinion was:
"I would like to come back to the over-provision / under-utilization part. Being a tech support guy, I think a bit different about that. State of the art is 16G FC now but of course I see the majority of customers being on 8G or even 4G. Eventually they will move to higher speeds. Not because all of them really need the higher speed, but it's just the switches and HBAs in sales and marketing at the moment. The "speed race" is driven mostly by the vendors and the customers who really need that line rate. But is it bad for the others? I don't think so. A 16G switch is not really 2x the price of a 8G switch or 4x the price of a 4G. In fact I see the prices sinking on a per port base with increasing functionality on the other hand. And then you stand there with your host X. It has a demand for let's say 200MB/s in total and you connected it to 2 redundant fabrics running with 8G, 1 port per fabric.
That makes: 200MB demand versus 1600MB available. WOW! YOU ARE TOTALLY UNDER-UTILIZED! Shame on you!
Well not really. Actually it's good to have redundancy. You know that. First of all "real" redundancy means you are at least 50% under-utilized per se. Plus the higher line rate that made no difference in the price compared to the lower line rate. That means it is normal that you end up over-provisioned and under-utilized.
In fact things start to get ugly if you really use all your links near 100%. I start to see that scenario more often recently when customers put VMs on ESX hosts without really knowing their I/O demand. Many of them work till the next outage (SFPs _WILL_ break some day, a software bug could crash a switch, etc) and then you see that you have no real redundancy, because you utilize your links too high.
On the other hand many of these ESX hosts with many VMs doing different unknown workload tend to turn to slow drain devices as soon as I/O peaks of certain VMs come together at the same time. Then at the latest you notice that under-utilization of a network is not really a bad thing :o)"
Especially the ESX hosts turning to slow drain devices bug me most these days. Nobody really seems to know the demand of their VMs and the internal statistics of the ESX seem to be very limited for that matter. If you look on a port of a slow drain device, it will most probably still look under-utilized from a bandwidth perspective, because the missing buffers plus the error recovery will keep the plain MB/s numbers down. But in fact the port is exhaustively saturated then. And in addition the the eventually dropped frames in the SAN lead to timeouts also within the slow draining host. At the end it looks like: "My ESX is far away from utilizing its link completely but the SAN is bad! We have timouts!".
So what's the demand?
Some customers have the luxury (Should this really considered to be luxury?) of having a VirtualWisdom probe installed to monitor the exact performance values in real-time constantly. Archie Hendryx shows some of the things you could see there in practice in his whitepaper "Destroying the Myths surrounding Fibre Channel SAN". But if you don't have such gear and you don't know the demand it might be worth to have an additional ESX host for testing. It must not be the biggest machine, don't worry. Every day you would take another candidate out of your bulk of VMs with unknown I/O bandwidth (or CPU / memory / etc) demand and put it on that test server with vMotion. Being relatively unimpaired by the other VMs (at least within the ESX), you can measure all the performance values then for 24 hours and - provided no error recovery or external congestion - takes place, these are the real demands of that VM. And only based on these demands you really know which VMs are allowed to come together on the same bare metal. Only so you will have a chance to actually improve the under-utilization in a controlled manner without slamming your SAN into the realms of chaos. The approach seems very simple and straight forward for me, but I see nobody doing this. So what's my error in reasoning, dear reader?
(Thanks to Harout S Hedeshian for the picture.)
Recently I attended a presentation about IBM's cloud computing approaches by IBM Fellow Stefan Pappe. Cloud computing is a big topic in IT nowadays - no doubt about that - but how much impact does it have on SAN troubleshooting? Will the way hardware support is performed change in the cloud? Based on your understanding of the term cloud you might eighter say yes or no. In a cloud the IT is just a commodity like water or electrical power. You just use it. You most likely don't want to know how it works as long as its availability is guaranteed. If a component of a server breaks, the whole construct relies on redundancy. Either within the server (multiple paths etc) or within a pool of servers where the VMs residing on this particular piece of metal are concurrently moved to other servers. This frees up the broken one for maintenance later on.
For a SAN it's quite similar - we rely on internal redundancy (multiple power supplies, failover-able control processors and backlink modules) as well as external redundancy (second independent fabric, multiple paths, multiple ISLs), with an important exception: Some SAN-related problems have to be troubleshooted "on the open heart". Please don't understand me wrong. I don't mean that finding a good workaround isn't important - it surely is and in most scenarios it's a key element for business continuity. But if the symptoms can't be seen, it might be hard for the support member to do the problem determination.
So what now?
Most of these "workarounded" problems can still be troubleshooted if the SAN is well prepared. Especially part 2 of my How to be prepared blog post can help you with that topic. In addition Please gather a data collection from each and every component in the SAN that is related to the problem before you implement any workaround! For the SAN switches that means, if you have performance problems for example, please gather a data collection of all SAN switches.
For other problems it might be necessary to actually test the repaired component / modified configuration / improvement in the code in the productive environment to know if it really helped. Of course all the possibles tests that can be done "offline" should be done first. For example before bringing a formely toggling ISL back to life, it's better to use the built-in port test capabilities of the switches with loopback-plugs.
And as another exception compared with server redundancy: A SAN troubleshooting should not be postponed to gather "workarounded" problems for a certain time and solve them later all at once.
- In most cases redundancy in the SAN means you have two things of a kind. Not five or eight or hundreds. So if the core of fabric A fails, it has to be repaired as soon as possible, because the failure of the core in fabric B will lead to the full outage.
- Different concurrent SAN problems can overlay and create much bigger problems or at least ambiguous symptoms that are much harder to troubleshoot. "Double errors" or "triple errors" are among the worst things to troubleshoot.
- SAN environments are complex structures with lots of hardware and software. There are many things that could lead to the situation that redundancy cannot be utilized properly such as bugs in multipath drivers, wrong configurations or underestimation of the workload on the redundant paths and components during a problem situation.
So if it can be done now, do it now!
Beside of that there are special requirements of the cloud such as the ability for multi-tenancy on the SAN components. Cisco have their VSANs for a long time now, but when it comes to IVR (Inter VSAN Routing) sometimes I see very strange configurations out there based on a wrong understanding of the concept. The first attempt of Brocade in that direction were the "Administrative Domains" which came with some very concerning flaws in my opinion. With the v6.2x code stream this concept was virtually replaced by the "Virtual Fabrics" concept. With "base switches", "XISLs" & co, many new possibilities for mis-configurations appeared. Much new stuff to learn for customers, admins, architects and of course support members.
To sum up, I can say that if SAN troubleshooting was done properly before, there won't be much change here. But the cloud boosts the expectations of the users regarding their SAN even more to: It should just work! No downtime of the application ever! Our primary goal is to deal with upcoming problems in a way that prevents any impact on the applications.
Because in the future zero downtime will be no highend enterprise feature anymore but a commodity.
If you use a SAN Volume Controller it usually is the linchpin of your SAN. Except for the FICON and tape related stuff everything is connected to it. It is the single host for all your storage arrays and the single storage for all your host systems. Because of this crucial role the SVC has some special requirements regarding your SAN design. The rules can be seen in the manuals or in the SVC infocenter (just search for "SAN fabric"). One of these rules is "In dual-core designs, zoning must be used to prevent the SAN Volume Controller from using paths that cross between the two core switches.".
I made this sketch to illustrate that. As you see it's not a complete fabric, but just the devices I want to write about. Sorry for the poor quality, my sketching-kungfu is a bit outdated :o)
This is just one of two fabrics. The both SVC nodes are connected to the both core switches. The edge switch is connected to both core switches and beside of the SVC business let's assume there is a host connected to the edge switch using a tape library connected to the cores. There would be other edge switches, more hosts and of course storage arrays as well. Now the rule says that the SVC node ports are only allowed to see each other locally - therefore on the same switch.
So why is that so?
Of course you could say that this is the support statement and if you want to use a SAN Volume Controller you just have to stick to that. But from time to time I see customers with dual-core fabrics who don't follow that rule. Of course initially when the SVC was integrated into the fabric, the rule was followed because it was most probably done by a business partner or an IBM architect according to the rules and best practice. But later then after months or years - maybe even the SAN admin changed - new hosts were put into the fabric and in an initiator-based zoning approach, each adapter was zoned to all its SVC ports in the fabric. Et voilà! The rule is infringed. The SVC node ports see each other over the edge switch again and the inter-node traffic passes 2 ISLs instead of none.
What is inter-node communication?
Beside of the mirroring of the write cache within an I/O group there is a system to keep the cluster state alive. It includes a so called lease which passes all nodes of a cluster (up to 8 nodes in 4 I/O groups) in a certain time to ensure that communication is possible. These so called lease cycles start again and again and they do even overlap so if one lease is dropped somehow and the next cycle finishes in time, everything is still fine. The lease frames will be passed from node to node within the cluster several times. But if there are severe problems in the SAN the cluster has to trigger the necessary actions to keep the majority of the nodes alive. Such an action would be to warm-start the least responsive node or subset of nodes. You will read "Lease Expiry" in your error log. In a worst case scenario where the traffic is heavily impacted to a degree that the inter-node communication is not possible at all, it might happen that all nodes do a reboot and if the impact stays in the SAN they might do that again and won't be able to serve the hosts.
The result - BIG TROUBLE!
Just as a small disclaimer to prevent FUD (Fear, Uncertainty and Doubt): This is not a design weakness of the SVC or something like that. All devices in a SAN are vulnerable to the risk I want to describe. In addition from all the error handling behavior of the SVC as I know it the SVC seems to be designed to rather allow an access loss than to allow data corruption. It is still the last resort but it's better than actually loose data.
Back to the dual-core design. The following sketch just shows that with the wrong zoning, the lease could take the detour over the edge switch instead of going directly from node 1 to node 2 via core 1 or core 2. It would pass 2 ISLs.
Why should I care?
There are several technical reasons why ISLs should be avoided for that kind of traffic but from SAN support point of view I consider this one as the mose important: slow drain devices! Imagine one day the host would act as a slow drain device for any reason. The tape would send its frames to the host passing the cores and the edge switch. As the host is not able to cope with the incoming frames now, it would not free up its internal buffers in a timely manner and would not send permission to send more frames (R_RDYs) to the switch quickly enough. The frames pile up in the edge switch and congest its buffers. The congestion back-pressures to the cores and finally to the tape drive. As the frames wait within the ASICs some of them will eventually hit the ASIC hold-time of 500ms and get dropped. This causes error recovery and based on the intensity of the slow drain device behavior it would kill the tape job. Bad enough?
But hey! The SVC needs these ISLs!
And that's were it gets ugly. In the sketch above the ISL between the core 1 and the edge switch will become a bottleneck not only for that tape related traffic but for the SVC inter-node communication as well. It will not only cause performance problems (due to the disturbed write cache mirroring) but also could lead to the situation that the frames from several SVC lease cycles in a row would be delayed massively or even dropped causing lease expiries resulting in node reboots.
That's why keeping an eye on the proper zoning for the SVC is so important and that's the reason for that rule.
Just as a short anecdote related to that: Some years ago I had a customer with a large cluster where not the drop of leases but the massive delay of them caused the problem. As every single pass of the lease from one node to the next was only just within the time-out values the subset of nodes that was really impaired by the congestion saw no reason to back out and reboot but as the overall time-out for the lease cycles was reached at a certain point in time, the wrong (because healthy) nodes rebooted then and the impaired ones were kept alive. Not so good... As far as I know some changes were done in the SVC code later to improve its error handling in such situations but the rule is as valid as ever:
Avoid inter-node traffic across ISLs!
Two additional topics for my previous post came into my mind and I doubt that they will be the last ones :o)
Have a proper SAN management infrastructure
For most of you it's self-evident to have a proper SAN management infrastructure, but from time to time I see environments where this is not the case. In some it's explained with security policies ("Wait - you are not allowed to have your switches in a LAN? And the USB port of your PC is sealed? You have no internet access? No, I don't think that you should send a fax with the supportshow...), sometimes it's just economizing on the wrong end. And sometimes there is just no overall plan for SAN management. So I think at least the following things should be given to enable a timely support:
- A management LAN with enough free ports to allow integration of support-related devices. For example a Fibre Channel tracer.
- A host in the management LAN which is accessible from your desk (e.g. via VNC or MS RDP) and has access to the management interfaces of all SAN devices. This host should at least boot from an internal disk rather than out of the SAN.
- A good ssh and telnet tool should be installed which allows you to log the printable output of a session into a text file. I personally like PuTTY.
- A tFTP- and a FTP-Server on the host mentioned above. It can be used for supportsaves, config backups, firmware updates etc. They should always run and where it's possible the devices should be pre-configured to use them. (e.g. with supportftp in Brocade switches)
- If it's possible with your security policy, it's helpful to have Wireshark installed on it which could be used for "fcanalyzer" traces in Cisco switches or also to trace the ethernet if you have management connection problems with your SAN products.
- The internet connection needs enough upload bandwidth. Fibre Channel traces can be several gigabytes in size. When time matters undersized internet connections are a [insert political correct synonym for PITA here :o) ]
- Callhome and remote support connection where applicable. Callhome can save you a lot of time in problem situations. No need to call support and open a case manually. The support will call you. And most of the SAN devices will submit enough information about the error to give the support member at least an idea where to start and which steps to take first. So in some situations callhomes trigger troubleshooting before your users even notice a problem. In addition some machines (like DS8000) allow the support to dial into it and gather the support data directly - and only the support data. Don't worry - your user data is safe!
- Have all passwords at hand. This includes the root passwords as some troubleshooting actions can only be done with a root user.
- Have all cables and at least one loopback plug at hand. With cables I mean at least: one serial cable, one null-modem cable, one ethernet patch cable and one ethernet crossover cable (not all devices have "auto-negotiating" GigE interfaces)... better more. And of course a good stock of FC cables should be onsite as well.
- The NTP servers as mentioned in my previous blog post.
Monitoring, counter resets and automatic DC
Beside of any SAN monitoring you hopefully do already (Cisco Fabric Manager / Brocade DCFM / Network Advisor / Fabric Watch / SNMP Traps / Syslog Server / etc) there is one thing in addition: automatic data collections based on cleared counters. Finding physical problems on links, frame corruption on SAN director backlinks, slow drain devices or toggeling ports - for all these problems it helps a lot if you can 1. do problem determination based on counters cleared on a regular basis and 2. look back in time to see exactly when it started and maybe how the problem "evolved" over time.
What you need is some scripting skills and a host in the management LAN (with an FTP server) to run scripts from as mentioned above. A good practice is, to have a look for a good time slot - better do not do this on workload peak times - and set up a timed script (e.g. cron job) that does:
- Gather data collections of all switches - use "supportsave" for Brocade switches and for Cisco switches log the output of a "show tech-support details" into a text file.
- Reset the counters - use both "slotstatsclear" and "statsclear" for Brocade switches and for Cisco switches run both "clear counters interface all" and "debug system internal clear-counters all". The debug command is a hidden one, so please type in the whole one as auto-completion won't work. The supportsave is already compressed but for the Cisco data collection it might be a good idea to compress it with the tool of your choice afterwards.
Additional hint: Use proper names for the Cisco Data collections. They should at least contain the switchname, the date and the time!
Depending on the disk space and the number of the switches, it may be good to delete old data collections after a while. For example you could keep one full week of data collections and for older ones only keep one per week as a reference.
If you have a good idea in addition how to be best prepared for the next problem case, please let me know. :o)
To be honest the title for this article could also be "How to ease the life of your technical support". But in fact it will ease the life of everyone involved in a problem case and the priority #1 is to solve upcoming problems as quickly as possible.
In the article The EDANT pattern I explained a structured way to transport a problem properly to your SAN support representative. In addition it might be a good idea to prepare the SAN for any upcoming troubleshooting.
The following suggestions are born out of practical experience. It's intended to help you to get rid of all the obstacles and showstoppers that could disturb or delay the troubleshooting process right from the start. Please treat them as well-intentioned recommendations, not as pesky "musts". :o)
Synchronize the time
Having the same time on all components in the datacenter is a huge help during problem determination. Most of the devices today support the NTP protocol. So the best practice is to have an NTP server (+ one or two additional ones for redundancy) in the management LAN and configure all devices (hosts, switches, storage arrays, etc) to use them. It's not necessary to have the NTP connected to an atomic clock. The crucial thing is to have a common time base.
Have a troubleshooting-friendly SAN layout
What is a troubleshooting-friendly SAN layout? I don't only mean that it's a good idea to always have an up-to-date SAN layout sketch at hand - which is very helpful in any case. What I mean is to have a SAN design that lacks of any artificial obscurities. If you have 2 redundant fabrics (yes there are still environments out there where this is not the case), it's best practice to connect all the devices symmetrically. So if you connect a host on port 23 of a switch in one fabric, please connect its other HBA to port 23 of the counterpart switch in the redundant fabric.
Use proper names
It may sound laughable but bad naming can harm a lot. I think 4 points are important here:
- The naming convention - It may be funny to have server names like "Elmo", "Obi-Wan" or "Klingon" but for troubleshooting it may be better to have some useful info within the name. Something like BC01_Bl12_ESX for example. (for Bladecenter 1, Blade 12, OS is ESX).
- Naming consistency - It's even more important to actually use the same names for the same item. So it's very helpful if for example the host has the same name in the switch's zoning, in the storage array's LUN mapping and on the host itself.
- Unique domain IDs - The domain ID is like the ZIP-Code for a switch and according to the fibre channel rules it has to be unique within a fabric. But in addition to that it is very helpful to keep it unique across fabrics as well. Domain IDs are used to build the fibre channel address of a device port - the address used in each frame. Within the connected devices's error logs (hosts, storages, etc) these fibre channel addresses are often the only information that reference for the SAN components. To be able to know which paths over exactly which switch are affected at any time is priceless.
- Brocade: chassisname - As Virtual Fabrics become more and more a standard in Brocade SANs it's crucial to set the chassisname, because the switchname is bound to the logical switch, not to the box. These chassisnames are used for the naming of the data collections (supportsaves) and if you don't configure them, the device/type will be used instead. So you'll most probably end up with a huge collection of supportsave files which differ only in the date. The chassisname can easily be set with the command "chassisname". That's one small step for... :o)
Use a change management
I couldn't emphasize this more: Please use a change mangement. Even for the smallest SAN environment where you would think "Nah! That's my little SAN, I can keep all the stuff in my head." Even for the biggest SAN Environment, where you would think "Nah! Too many people from too many departments are involved here. The SAN is living and evolving every day." Beside of any internal policy and external requirement (mandatory change management methods for several industries) a proper change management also helps in the troubleshooting process. If you can come up with a complete time plan of all actions done in the SAN and the assertion that no unplanned maintenance actions are done in the SAN during the problem determination you will have a very happy SAN support member :o)
Backup your configuration
Bad things could happen every day. Things that wipe parts or all of your switches's configuration or even worse turn them into useless doorstoppers. It's not likely that it happens, but if and when it happens you better be prepared. To be up and running again as soon as possible, you should not only back up your user data but also your configurations on a regular basis. For Brocade switches use "configupload" and for Cisco switches copy the running-config to an external server. The SAN Volume Controller (SVC) and the Storwize V7000 have options to backup the configuration in their GUI as well. Beside of that it helps a lot to also store all your license information for your switches at a well known place. At least for the SAN switches IBM cannot generate licenses and there's also no "emergency stock" for licenses. The support would have to open a ticket at the manufacturer and clarify the license issue with them. This might cost precious time in problem situations.
Keep you firmware up-to-date
This advise often has the smack of a "shoot from the hip", something like "Did you reboot your PC?" for PC tech support. But to be fair, it's not just the SAN support member's blanket mantra. No software is absolutely bug free and because of that there are patches or - for the SAN topic - more likely maintenance releases. Often there are parallel code streams. Newer ones with more features but with a higher risk of new bugs. On the other hand older ones with a long history of fixed defects and a "comfortable" level of stability but most probably already with an "End of Availability" in sight. And between these both extremes are the mature codes like the v6.3x code stream for Brocade switches. It doesn't have the latest features but a good amount of "installed hours" all over the world. It is still fully supported, so if you really would run into a new bug, Brocade would write a fix for it. It's essentially the same for Cisco and for our virtualization products.
So it's up to you. If you want the new features, you have to use the latest code. If you don't need them at the moment, the latest version of a mature code stream might be better for you. Of course you have to align these considerations with the recommended or requested versions of the connected devices as some really require a specific version. A best practice is to update the switches and if possible also all devices proactivily twice a year - beside of any additional recommended updates due to problem cases where a particular bug has to be fixed. If you need support with all the planning and doing, please contact your local IBM sales rep for an offering called Total Microcode Support. These guys will check the SAN environment including the attached devices for their firmware and will come up with a consistent list of recommended versions which should be compatible and cross-checked. Another view on the topic comes from Australian IBMer Anthony Vandewerdt in his Aussie Storage Blog.
Think about your features
Speaking about code updates and features, it's of course a good idea to actually read the release notes. They contain crucial information about the version and should also explain new features. The crux of the matter is that there could be new features that you actually do not need and some of them will be enabled by default. One of these examples is the Brocade feature "Quality of Service" (short: QoS). In simple terms it will "partition" the ISLs to grant high prioritized traffic to have some kind of "right of way" to medium or low prioritized traffic. Buffer-to-Buffer credits will be reservered for the different priority levels to enable this. But to really use it, you actually have to decide which traffic falls into which category. You would do this by so called QoS-Zones. If you don't configure the zones but leave QoS enabled, all the traffic is categorized as medium prioritized and you don't use the reservered resources for the high and the low priority. In times of high workload, this might end up in an artificial bottleneck resulting in frame drops, error recovery and performence problems. This is only one example that shows that it's better to be aware which additional features are activated and if you really need them.
Know the support pages
IBM as well as other vendors has a comprehensive "Support" section on its homepage. It offers loads of information, manuals, links to code downloads, technotes and flashes. It's possible to open and track a support case there via the web. With all the stuff on these pages and all the products IBM offers support for you might get lost a bit. Our "IBM Electronic Support" team (@ibm_eSupport) is constantly optimizing these pages but the hint number one is: Register for an account and set up these pages for you as you like them. So you have your products at hand and you find all related information easily. And if you have some spare time (do you ever?) just have a look around on the support pages. There might be useful hints or important flashes concerning your IBM products.
As always this "list" isn't exhaustive and you probably did additional things to be prepared for problem determination. Feel free to share them in the comments below. Thank you!
One of the ugliest things that can happen in a SAN is a big performance problem introduced by a slow drain device (or slow draining device). Why is it so ugly? Well, if a full fabric or a full data center drops down - due to a fire for example - it's definitely ugly, too. But such situations can be covered by redundancy (failover to another fabric, to another data center, etc), because the trigger is very clear. Whereas a performance degredation due to a slow drain device is not so obvious - at least not for the most hosts, operators or automatic failover mechanisms. Frames will be dropped randomly, paths fail but with the next TUR (Test Unit Ready) they seem to work again, just to fail again minutes later. Error recovery will hit the performance and the worst thing: If commonly used resources are affected - like ISLs - the performance of totally unrelated applications (running on different hosts, using different storage) is impaired.
So you have a slow drain device. If you have a Brocade SAN you might have found it by using the bottleneckmon or you noticed frame discards due to timeout on the TX side of a device port. If you have a Cisco SAN you probably used the creditmon or found dropped packets in the appropriate ASICs. Or maybe your SAN support told you where it is. Nevertheless, let's imagine the culprit of a fabric-wide congestion is already identified. But what now?
The following checklist should help you to think about why a certain device behaves like a slow drain device and what you can do about it. I don't claim this list to be exhaustive and some of the checks may sound obvious, but that's the fate of all checklists :o)
- Check the firmware of the device:
Check the configuration:
- Is this the latest supported HBA firmware?
- Are the drivers / filesets up-to-date and matching?
- Any newer multipath driver out there?
- Check the release notes of all available firmware / driver version for keywords like "performance", "buffer credits", "credit management" and of course "slow drain" and "slow draining".
- If you found a bugfix in a newer and supported version, testing it is worth a try.
- If you found a bugfix in a newer but unsupported version, get in contact with the support of the connected devices to get it supported or info about when it will be supported.
Check the workload:
- Is it configured according to available best practices? (For IBM products, often a Redbook is available.)
- Is the speedsetting of the host port lower than the storage and switches? Better have them at the same line rate.
- Queue depth - better decrease it to have fewer concurrent I/O?
- Load balanced over the available paths? Check you multipath policies!
- Check the amount of buffers. Can this be modified? (direction depends on the type of the problem).
Check the concept:
- Do you have a device with just too much workload? Virtualized host with too much VMs sharing the same resources? Better separate them.
- Too much workload at the same time? Jobs starting concurrently? Better distribute them over time.
- Multi-type virtualized traffic over the same HBA? One VM with tape access share a port with another one doing disk access? Sequential I/O and very small frame sizes on the same HBA? Maybe not the best choice.
Check the logs for this device for any incoming physical errors. Of course, error recovery slows down frame processing.
Check the switch port for any physical error. If you have bit errors on the link, the switch may miss the R_RDY primitives (responsible for increasing the sender's buffer credit counter again after the recipent processed a frame and freed up a buffer).
Use granular zoning (Initiator-based zoning, better 1:1 zones) to have the least impact of RSCNs. (A device that has to check the nameserver again and again has less time to process frames.)
If all other fails: Look for "external" tools and workarounds:
- If the slow drain device is an initiator, does it communicate with too many targets? (Fan-out problem)
- If the slow drain device is a target, is it queried by too many initiators? (Fan-in problem)
- Is it possible to have more HBAs / FC adapters? On other busses maybe?
- Is the device connected as an L-Port but capable to be an F-Port? Configure it as an F-Port, because the credit management of L-Ports tends to be more vulnerable for slow drain device behavior.
- Does the slow drain host get its storage from an SVC or Storwize V7000? Use throttling for this host. Other storages may have similar features.
- Brocade features like Traffic Isolation Zones, QOS and Trunking can help to cushion the impact of slow drain devices.
- Have a Brocade fabric with an Adaptive Networking license? Give Ingress Rate Limiting a try.
- Last resort: Use port fencing or an automated script to kick marauding ports out of the SAN.
The list above is just a collection of things I already saw in problem cases. Having said this, it might be updated in the future if I encounter more reasons for slow drain device behavior. Of course I'm very interested in your opinion and more reasons or ways to deal with them!
From time to time (sometimes everyday - the support business is a capricious one) I need to see what's really going on in the fibre. For that reason we have a couple of tracers which can be sent to the EMEA countries. Some IBM organizations in some countries even have their own tracers. For the SAN support we use the XGIGs from JDSU (originally from Finisar). Usually I trace, if the problem is somehow protocol related and cannot solved with the RAS packages of the switches and the devices. Or if the RAS information from one device contradicts the other one. Or if every support team (internal and external) points to each other. Or if something totally strange happens and nobody can deal with it. Maybe we trace a little too often, because meanwhile other vendors sometimes say things like "Oh, you also have IBM gear in your environment? Let them trace it!".
So what's this tracing all about?
To put it simple, you connect it in the line and it just records all the traffic. Of course you can filter it and let it trace only the interesting part of the frames. I do not care for the actual data but the FCP and the SCSI header info are precious information. Of course an 8 Gbps link generates a lot of data, too and the memory is very limited. So you want to be sure to trace exactly what you need - not more, not less. The tracing is done by IBM customer engineers. We ensure to have a suitable number of trained CEs in every region. I hosted some of the trainings by myself and imho it's definitely worth it. The analysis is then done afterwards. I personally like it, because it offers me a possibility to not be "bound" to the RAS packages alone. I can really see what happens.
Although the whole topic is pretty much straight forward, for the ones unfamiliar with it, tracers seem to be mystical devices. Over time I faced several "urban legends" impeding troubleshooting a lot sometimes:
- "What info? You should see that in the trace!" - Often I get no additional information for a trace (e.g. consisting of 8 trace files from different channels) which slows down the analysis extremely. I need at least a layout where I can see where exactly the tracer was connected. I need to know how it was configured, if the problem really happened during the trace and I need the data collection of the switch and the devices to compare what I see against the RAS packages. Please help me to help you! :o)
- "We can't put this link down. Is it important where to plug in the tracer?" - Yes, of course it is. Like described above, it just records the traffic that enters the tracer. Nothing more. There are no tiny little photon-based nano robots swarming out through the fibres and collecting data. Really. If you plug it somewhere else, I won't see the problem.
- "Thank you so much for introducing a tracer in our environment. It solved the problem. It has to stay." - No, the tracer did not solve the problem by itself. If the problem somehow vanished with cabeling in the tracer, then a simple portdisable/portenable should have helped as well. The tracers are needed frequently and can't stay in the environment till the end of days.
These were just some of the rumors and statements I heard in the past. To summarize it, please keep in mind:
A tracer is not a magical device. It just records traffic.
If you work in the technical support, generally spoken your job is to fix what's broken. But working in the SAN support most of the time is about solving complex problems. The SAN connects everything with everything in the storage world and often that's a lot. Oh yes, there are well-planned and "troubleshooting-friendly" environments out there, managed by top-skilled administrators using state-of-the-art tools, while having enough time between daily routine and important projects to spot problems before they even have an impact on the applications. At least I believe that these things exist, but most of the time I did not even see a part of it. There are excellent multi-tenancy capable products out there, maintained by a single part time admin or an operator some thousand miles away monitoring the environments of a dozen clients. And when there is a problem, this poor guy is called by all the angry people relying on a working IT up to the C-levels. Then he opens a case at his SAN vendor.
Let's switch to the support guy. He takes the new case and reads. "Massive problem, SCSI error!". Yes, most of the time there is just a statement like this. That's okay for the beginning, because the so called "Request Receipt Center" just creates cases administratively (OMG, is that even a word in the English language?). The first level of support, the so called Frontend will call you then and ask you about the problem. And they (hopefully) will bring the information in a pattern called "EDANT" to have it in a structured way and to be able to hand it over (horizontally for shift changes or vertically for escalation) to others. This first call (sometimes 2..n) is crucial because the most important thing is to actually understand the problem. That sounds trivial, but it's not. In fact the whole problem determination will fail or at least significantly lag if this set of information is not complete or contains false statements.
I know you will be under pressure. I know you have thousand other things to do. I know some sales guy probably promised you "Our excellent support will solve all problems - if there'd ever be one - just by hearing the tone of your voice for 1.4 seconds!". But again, to enable the support guy to actually understand your problem is the most important thing and you can hugely accelerate that process by preparing the information using the EDANT pattern.
So what's this EDANT pattern exactly? I have to admit, we stole it from the software guys. You will notice that by the wording. EDANT means:
E is for Environment. You (hopefully) know your environment and maybe you described it to IBMers several times before, maybe an IBM architect even designed it. But to be honest IBMers don't share a collective consciousness like the Borg :o), on the other hand things change. So what's needed is a good description of the environment related to the current problem. This includes among others:
- A layout with the related switches and devices and the ports used to connect them.
- The machine/model information of related switches, hosts, storages, etc
- The firmware/OS/driver levels of all components.
- Time gaps between the components. (Better use NTP!)
- If you use SAN extenders, describe them. Use CWDM/DWDM/TDM? How long? Type? Vendor? Cards? Versions? Transparency? Use FCIP? Bandwidth? Quality?
- Additional specialities: any interop stuff going on? This is a test SAN? This is pre-production? This is designed without redundancy? Stuff like this...
D is for Description. Please describe your problem as precise and as comprehensive as possible.
- When did it start?
- What did happen?
- Where can you notice it?
- What do the switches report?
- What do the other devices report?
- What was done when the problem happened?
- What is the impact?
And in regard to the environment please ask yourself: Which components are affected? Which components could be affected but are not? What is the difference between them? Questions like these are the key for narrowing down the problem.
A is for Actions Done. Opening a case is most probably not the first thing you do, when the phones begin to ring. When a case reachs me, "someone" did already "something". Maybe you have a plan for situations like this. Maybe someone requests "Do things!". Maybe you switched off "culprit candidates". All this should be documented as accurate as possible. With time stamps! And of course with results. Everything that changed in the environment since the problem occured is worth to mention, including counter resets. Do as much as possible from CLI (Command Line Interface) and use session logging. Precious!
N is for Next Actions. This section is for everything you already plan (maintenance windows, replacements, recovery actions, internal and external deadlines) and for everything you expect from the support. The second point is not trival, too. Of course you want the support to solve the problem. But what is most important? Do you need a workaround first, to get things working again? Do you need an RCA (Root Cause Analysis) the next day? Does the problem has to be solved over night and a contact person will be available to provide data and further info? Provide your expectation to get the right help.
T is for Test Case. Okay, this one is clearly from the software support. It's the data collections and any additional data and description of it, like the session logs mentioned above. Screenshots, performance data or scripts belong here too. Usually the support offers a way to upload all the stuff. Please be aware that for example IBM doesn't keep data collections from cases till the end of days. So if you uploaded something for another already closed case 6 months ago, it's most probably gone.
Using this pattern to structure the info should avoid any communication based delays. It may sound like much stuff in the beginning. But it's definitely worth it.