I was asked where to look at in a switch to find the average frame size for a port. The safest way would be to use an external monitoring tool like a VirtualWisdom or a tracer as described in my LD mode article but if you don't own something like that you can get a good guess from the switches themselves. You just have to calculate it out of the number of frames and the number of bytes transferred.
For Cisco it's easy. Just look into the "show interface" for the specific port and you'll find the both numbers in the statistics section for each interface:
1887012 frames input, 1300631486 bytes
542470 frames output, 482780325 bytes
So we can just calculate the average frame sizes for both directions:
1300631486 bytes / 1887012 frames = 689 bytes per frame
482780325 bytes / 542470 frames = 890 bytes per frame
For Brocade switches you can get the information out of the portstatsshow command:
stat_wtx 35481072 4-byte words transmitted
stat_wrx 70173758 4-byte words received
stat_ftx 1111087 Frames transmitted
stat_frx 1177665 Frames received
Here we don't have the plain bytes but 4-byte words. Don't worry - fillwords don't count into this number, so it's still valid for our calculation. We just have to multiply it by four to use it:
(35481072 * 4) bytes / 1111087 frames = 128 bytes per frame
(70173758 * 4) bytes / 1177665 frames = 238 bytes per frame
It's really that easy?
Basically yes. With this average frame size you can find out the multiplier for the buffer credits settings. So if you have an average frame size of 520 and a link of 30 km, just calculate:
2112 (the max frame size) / 520 = 4
So you would set up the link for 120 km instead of 30 km to reserve a sufficient amount of buffers. That's it.
A last catch
If you read my article about bottleneckmon you probably already know that we work with 32 bit counters here. While they cover a few hours for the frames they wrap much quicker for the 4-byte words. So to be able to calculate an average frame size over several hours or days, 32 bit counters are not enough. Actually there are 64 bit counters for these values in the switches - although they are not part of a supportsave. The command portstats64show provides them. The first thing to keep in mind: While in the latest FabOS versions a statsclear resets these counters as well, you had to reset them with portstatsclear in the older versions.
The 64 bit counters are actually two 32 bit counters and the lower one ("bottom_int") is the 32 bit counter we used all the time in portstatsshow. But each time it wraps, it increases the upper one ("top_int") by 1. So after a while you might see an portstats64show output like this:
stat64_wtx 0 top_int : 4-byte words transmitted
2308091032 bottom_int : 4-byte words transmitted
stat64_wrx 39 top_int : 4-byte words received
1398223743 bottom_int : 4-byte words received
stat64_ftx 0 top_int : Frames transmitted
9567522 bottom_int : Frames transmitted
stat64_frx 0 top_int : Frames received
745125912 bottom_int : Frames received
For the received frames it's then:
(2^32 * 39 + 1398223743) * 4 bytes / 745125912 frames = 907 bytes per frame.
Much manual computing, hmm?
Of course you could write a script for that or prepare a spreadsheet but my recommendation is still to start with a multiplier of 3 for normal open systems traffic and check with the command portbuffershow how many buffers are still available. And if you still have some, use them - but keep them in mind if you connect additional long distance ISLs or devices you want to give additional buffers as well.
Update Nov. 2nd 2012:
I was made aware that there is an easier and much more convenient way to use portstats64show: Just use the -long option.
pfe_ODD_B40_25:root> portstats64show 26
stat64_wtx 7 top_int : 4-byte words transmitted
485794041 bottom_int : 4-byte words transmitted
stat64_wrx 13 top_int : 4-byte words received
2521709207 bottom_int : 4-byte words received
pfe_ODD_B40_25:root> portstats64show 26 -long
stat64_wtx 30557972957 4-byte words transmitted
stat64_wrx 58371265974 4-byte words received
Much better, isn't it? Thanks to Martin Lonkwitz!
In one of my previous posts I wrote about "Why inter-node traffic across ISLs should be avoided". There is an additional "bad practice" that could lead to performance problems in the host-to-SVC traffic.
Let's imagine a core-edge fabric. A powerful switch (or director) in its center is the core. The SVC and its backend storage subsystems are directly connected to it. Beside of that there are also the ISLs to the edge switches where the hosts are connected to. As there is an SVC in the fabric, all host traffic usually goes to the SVC and the SVC is the only host of all other storages. From time to time I see a cabling like the one below. The devices are connected in a common pattern. For example SVC ports are always on port 0, 4, 8, ... or for a director for example on port 0 and 16 on each card... Something like that. The reason behind that is often to spread the workload over several cards/ASICs to minimize impact in case of a hardware failure. But there's a risk in doing so.
Index Port Address Media Speed State Proto
0 0 190000 id 8G Online FC F-Port 50:05:07:68:01:40:a2:18
1 1 190100 id 8G Online FC F-Port 20:14:00:a0:b8:11:4f:1e
2 2 190200 id 8G Online FC F-Port 20:16:00:80:e5:17:cc:9e
3 3 190300 id 8G Online FC E-Port 10:00:00:05:1e:0f:75:be "fcsw2_102" (downstream)
4 4 190400 id 8G Online FC F-Port 50:05:07:68:01:40:06:36
5 5 190500 id 8G Online FC F-Port 20:04:00:a0:b8:0f:bf:6f
6 6 190600 id 8G Online FC F-Port 20:16:00:a0:b8:11:37:a2
7 7 190700 id 8G Online FC E-Port 10:00:00:05:1e:34:78:38 "fcsw2_92" (downstream)
8 8 190800 id 8G Online FC F-Port 50:05:07:68:01:40:05:d3
The SAN perspective
In the situation described above, all host traffic is passing the ISLs from the edge switches to the core. ISLs are logically "partitioned" into so called virtual channels. Of course the ISL is still just one fibre and only one signal is passing it physically at the same time. The virtual channels are just portions of buffer credits dedicated and the decision which virtual channel a frame takes - and therefore which portion of the buffers credits it uses - is made by looking into the destination fibre channel address.
Technical deep dive
A normal non-QOS ISL has 4 virtual channels for data traffic. For an 8G link each one of them has 5 buffers. They can only work with these 5 buffers and there is no possibility to "borrow" some out of a common pool like for QoS links. With the command "portregshow" you can see the buffer credits assigned to the virtual channels (I added the first line):
VC 0 1 2 3 4 5 6 7
0xe6692400: bbc_trc 4 0 5 5 5 5 1 1
Only VCs 2-5 are used for data traffic. This makes 20 usable buffers which normally should be enough for a normal multimode connection between two switches in the same room with only some metres cable length. Basically the switch uses the last two bits of the second byte of the destination address. That looks so:
Bits 00 -> frame uses VC 2 (which is the first virtual channel for data)
Bits 01 -> frame uses VC 3
Bits 10 -> frame uses VC 4
Bits 11 -> frame uses VC 5
So where's the problem now?
In our imaginary core-edge fabric where for example all SVC ports are connected to ports 0 (bin 00), 4 (bin 100), 8 (bin 1000), 12 (bin 1100) , ... all host I/O towards SVC would use the same virtual channel. As this is the only traffic that passes the ISLs from edges to cores, only a quarter of the buffers are actually used! 5 buffers are very heavy in use and 15 are idling around never to be filled. And 5 buffers are actually pretty few for an edge switch full of hosts wants to speak with the core switch where the SVC is connected. The result would be credit starvation and congestion on a virtual channel level.
How to solve that?
There are 3 possibilities:
1.) You could re-cable your SAN in a manner that all VCs are used. But beside of the risk of physical problems and problems introduced by maintenance actions the devices have to learn about the new addresses of the SVC ports. For many operating systems this still means reboots or reconfigurations. It could involve a lot of work and risk for outages.
2.) You could just change the addresses with the portaddress command. This command is usually used in the virtual fabric environment and if you can use it depends on installed firmware and used platform. While it avoids the physical actions, it still has the disadvantages for the hosts because of changed addresses.
3.) The best and least interrupting possibility might be to set the ISLs to LE mode. This is the long distance mode dedicated for links under 10km in length. It will not only put more buffers on the link (40 for user traffic in an 8G link compared with the 20 for a normal 8G E-Port) but will also collapse the 4 user traffic VCs to just one. It looks like this then:
VC 0 1 2 3 4 5 6 7
0xe6602400: bbc_trc 4 0 40 0 0 0 1 1
So all buffers and therefore also all buffer credits will be used by the hosts and nothing idles. There will of course be a short interruption while changing the ISL to LE mode but beside of that nothing changes for the hosts, because all the addresses stay the same. This is clearly the way to go in the situation described above.
Just something strange for the end: Some switches are delivered from manufacturing with an alternative addressing pattern. For example port 1 of domain 3 won't have the address 030100 then but something like 030d00. In that case the problem can happen similarly but on other ports. But using LE-mode would solve it in the pretty same way.
Please keep in mind that the whole article relates to a very special (although very common) SAN layout in an SVC-centered environment. This is clearly not a standard action plan for all performance problems but it could help if you have a customer in a situation like this. For any questions, feel free to contact me.
Additionally, please be aware that this is not an SVC problem by itself but will happen with every central storage connected to a switch using a pattern as described above and being used by hosts connected to another switch over an ISL!
Update from May 9th:
I was made aware that readers of this article queried their vendors, maintenance providers or business partners with the idea to just set all their ISLs to LE-mode regardless if the condition as described above is actually met. Because of that, I would like to state more clearly: Using LE-mode as a general approach for your ISLs can cause severe problems!
If the SVC ports are not connected in a way that only one Virtual Channel would be used, it actually makes sense to have ISLs with more than one VC. Virtual Channels are a good feature to prevent that a latency bottleneck due to back pressure impairs the traffic of all devices using the same ISL. If devices on the edge switches communicate with other devices connected to other ports of the core (or other edges) as well, the impact of using LE-mode would be even more extreme in the case of slow drain devices.
I made some drawings to illustrate this. The first one shows 1 normal ISL between the edge and the core. You can see the 4 VCs used for data traffic. (I left out the other VCs for better visibility):
Here host 1 and 2 make traffic against the SVC (green), host 3 against an additional disk subsystem (purple) and host 4 against a tape drive (orange). Based on the ports these devices are connected to, other VCs are used for that traffic.
If you would use an LE-port instead, it would look like this:
Now all 4 data traffic VCs collapsed to a single one. As long as everything runs smoothly, you won't see an impact.
Buf if for example one of the devices connected to the core is slow draining, following will happen most probably:
In the picture above the purple disk is a slow drain device. Due to back pressure the whole ISL will be a latency bottleneck, because all data traffic shares the same VC in LE-mode. The back pressure goes further towards the edge switch and all 4 hosts of our example are affected now although only host 3 communicates with the slow drain device!
With a normal E-port it looks like this:
Now only VC4 is affected while VC2, 3 and 5 are running smoothly, because they have their own, unaffected buffer management. Therefore only host 3 will face a performance problem while the hosts 1, 2 and 4 are running fine.
You see: Using LE-mode for the purpose described in my original article does only make sense if these special conditions are really met. In all other cases it can impair the SAN performance tremendously!
I didn't blog for a while now because of an internal project. Like each software development project it's never really over and development will be going on in the next years to bring in new functions, but I hope I have some more time for blogging again now. :o) I also decided to go a bit away from the long blog posts I did in the past to more conveniently readable short posts if possible.
Long distance modes
Brocade has basically 3 long distance modes:
- LE mode - merges all user-data virtual channels and assigns the amount of buffers necessary to cover a 10 km distance based on the full frame size for the given speed. It requires no license.
- LS mode - like LE mode, but is used for distances > 10 km and requires the "Extended Fabric License". You configure it with a fixed distance.
- LD mode - similar to LS mode, but the distance is measured automatically and the buffers are assigned according to the measured distance. You configure it with a "desired distance".
So what's the problem with LD?
If you have two data centers with a distance of 30 km between them and you configure 60 km, the switch will only assign the buffers for the measured 30 km. Increasing the desired distance doesn't change anything.
Wait! Why should I increase it anyway?
As written above the number of buffers depends on the distance. The switch just calculates the amount of buffers by the number of full sized frames (frames with maximum frame size - usually 2kB) needed to span the distance. But the problem is: in real life the average frame size is actually much smaller than the maximum one.
In the picture above you see a write I/O out of a fibre channel trace. The lines with the rose background are the frames from the host, the ones with the gray background are the responses from the storage. The last column shows the size of the frame. Only the 4 data frames have the full frame size. The other 3 frames have a size far smaller than 2kB. So the average frame size in this example is just 1.2kB. With this average frame size you would need almost double the amount of buffers to fill the link than the number the switch calculated! And it could be much worse. I ran a report over the full trace and the average frame size for the transmit and receive traffic was:
Given that numbers and added a "little buffer reserve" you would need 3 times the buffers than the switch would use!
Okay so let's give it more buffers!
Yes, for LS mode this would exactly be the action plan. But remember: For LD mode, the switch just uses the measured distance. The desired distance is only used as an additional maximum. So if you have 30 km and configure 20 km, it will only assign the buffers for 20 km. If you configure 50 km, it will only assign the buffers for 30 km. So my general recommendation is:
Use LS instead of LD!
LS mode gives you the full control. And use it with enough buffers by configuring a multiple of the physical distance. 3x is a good practice but you can increase it even more if there are buffers left. You can always check the available buffers with the command "portbuffershow".
Don't leave those lazy buffers unassigned but use them to fill your links!
I claim that in 2012 performance problems will keep their place amongst the most frequent and most impacting problems in the SAN. In many of the cases the client's users really notice a performance impact and so the admin calls for support. Other support cases are opened because of performance related messages like the ones from Brocade's bottleneckmon or Cisco's slowdrain policy for the Port Monitor. Beside of that there are also cases that look not really like performance problems from the start but turn out to occur because of the same reasons like them. "I/O abort" messages in the device log, link resets, messages about frame drops, failing remote copy links, failing backup jobs or even worse failing recoveries - these could all be "performance problems in disguise".
When I analyze the data then and find out that a slow drain device or congestion is the real reason for the problem I write my findings down and try to give the client some hints about possible next steps. For example by mentioning my earlier blog article about How to deal with slow drain devices.
Do you know what's mean about it?
Often clients never heard of slow drain devices before. Longtime storage administrators are confronted with a term that sounds like a support guy made it up to fingerpoint to another vendor's product. Of course I usually explain what it is, what it means for the fabric and for the connected devices. But to be honest, I would be sceptical, too. I would go to the next search engine and query "slow drain device". The first finds are from this blog and from the Brocade community pages and there are some questions about that topic. Considering the substance of posts in public forums, I would check Brocade's own SAN glossary. Guess what? Not a word about slow drain devices - Which is no surprise as it's from 2008. I would check wikipedia. Nothing. My fellow blogger Archie Hendryx mentioned that it's missing in the SNIA dictionary, too. And he's right: Nothing!
So why is that so?
Why are the terms "HTML" and "export" explained in the dictionary of the Storage Networking Industry Association but there is not a single appearance of the term "slow drain device" on the complete SNIA website (according to their in-built search function)? Well I don't know but of course we can change that. The SNIA dictionary makers are asking for contribution, so if you have a term that has a meaning in the storage industry, feel free to send them a definition for the next release. I thought about doing that as well for some of the SAN performance-related terms I didn't find in the dictionary. Below you'll find some definitions that I wrote. But I'm not inerrable and therefore I would like to have an open discussion about them. Let me know what you think about them. Let me know if your understanding of a term (used in the area of SAN performance of course) differs from mine. Let me know if my wording hurts the ears of native English speakers. Let me know if you have a better definition. Let me know if there are important terms missing. And let me know if you think that a term is not really so generally used or important that it should appear in the SNIA dictionary - side by side to sophisticated terms like Tebibyte :o).
slow drain device - a device that cannot cope with the incoming traffic in a timely manner.
Slow drain devices can't free up their internal frame buffers and therefore don't allow the connected port to regain their buffer credits quickly enough.
congestion - a situation where the workload for a link exceeds its actual usable bandwidth.
Congestion happens due to overutilization or oversubscription.
buffer credit starvation - a situation where a transmitting port runs out of buffer credits and therefore isn't allowed to send frames.
The frames will be stored within the sending device, blocking buffers and eventually have to be dropped if they can't be sent for a certain time (usually 500ms).
back pressure - a knock-on effect that spreads buffer credit starvation into a switched fabric starting from a slow drain device.
Because of this effect a slow drain device can affect apparently unrelated devices.
bottleneck - a link or component that is not able to transport all frames directed to or through it in a timely manner. (e.g. because of buffer credit starvation or congestion)
Bottlenecks increase the latency or even cause frame drops and upper-level error recovery.
Feel free to use the comment feature here or tweet your thoughts with hashtag #SANperfdef. If you add @Zyrober in the tweet, I'll even get a mail :o)
I updated the definitions with an additional sentence. Feel free to comment.
The term ecological footprint describes the total impact of someone or something on the environment. To achieve sustainability this footprint should be kept as low as possible. We should not demand more from Mother Nature than she can provide and of course we should not demand more than we actually really need. Sounds simple, but the reality is way more complex. In the area of IT the term Green IT was found to describe and consolidate all the rules, actions and requirements to decrease the ecological footprint for the sake of sustainability. And IBM has a broad agenda about this. But often we forget what each one of us could do to be a little more greener.
In the technical support we deal with defects. Our clients have the right to have a product working within the specifications. If a part is working outside its specifications, it has to be repaired or replaced. That's it.
And what's "green" about that?
The impact on the Nature happens if a part is replaced that was not really broken. No manufacturing process of a part can be so "green-optimized" that it's better than just to avoid replacing a part in good order. There is the mining (and/or recycling) for the materials, the chemicals and energy used during its processing, the packages, the stocking and of course the logistics, too. At the end a small part like a fan can have a huge ecological footprint. This can only be avoided by replacing only the broken part. There's just one problem with that:
What if you can't tell which part is broken?
A classical example for that is a physical error in the SAN. In my article about CRC I pointed out how to use the porterrshow to find physical errors and - even more important - how to find the connection where the physical error is really located. But that's all what's possible out of the data: You can only track it down to the connection. The connection usually consists of the sending SFP, the cable (plus any additional patch panels and couplers in between), and the receiving SFP. There is no reliable and technically justifiable way to tell which one is the culprit just out of the porterrshow. I know that there are some "whitepapers" available in the web stating that this combination of "crc err" and "enc in" means this and that combination of "crc err" and "enc out" means that. But from a technical point of view that's nonsense.
So you have a physical problem, what to do?
When it comes to cables, my fellow IBM blogger Anthony Vandewerdt just released a great article about the impact of dust today. Other reasons for a cable to cause physical problems could be a too small bending radius or loose couplers. In times of fully populized 48- or even 64-port cards the frontside of a SAN director often looks like the back of a hedgehog. For every maintenance action with one of the cables you can wait for the CRC error counters increasing for the other ports around then. So in many situations the cable is not really broken and just replacing it wholesale just because of the counter is not eco-friendly.
The same thing with SFPs. You see physical errors increasing in the porterrshow for a specific port. That could mean that the SFP in there is broken, because its "electric eye" doesn't interpret the (good) incoming signal correctly. It could also mean that the SFP on the other end of the cable is broken, because it sends out a signal in a bad condition. Both will lead to the very same counter increases in porterrshow. If you replace them both as the first action you most probably replaced at least one good one.
Given that you have redundancy in your SAN environment (which you should ALWAYS have), you have free ports available, and the multipath drivers for the hosts using the affected path are working properly, you could track the culprit down by plugging the cable to another SFP in another port and look if the error stays with the port or with the cable.
Please keep in mind that the port address ("the IP address of the SAN") could change along with the port (if you don't have Cisco switches). On Brocade switches you need to do a "portswap" to swap the port addresses as well.
If you cannot touch the other ports, Brocade built some tests into FabricOS for you, like "porttest", "portloopbacktest" and "spinfab". Please have a look into the Command Line Interface Reference Guide for your FabricOS version to get more information about them. With these tests in combination with a so called loopback plug it's easy to find out which part is really broken. Loopback plugs look like the end of a cable but just physically redirect the SFP's TX signal into its RX connector.
Mother Nature will be thankful
There is just one thing from above I want to pick up: parts working within their specification. Not every single CRC error is a reason to replace hardware. According to the Fibre Channel standard, the protocol requires a BER (Bit Error Rate) of 10^(-12) to work properly. For 8 or even 16 Gbps that means it's allowed and fully compliant with the FC protocol to have bit errors quite often. Here is where common sense must come into play. If you have 2-digit increases of the CRC error counter within an hour, it might be a good idea to determine which part to replace with the steps mentioned above.
If you see a single CRC from time to time, sometimes with days of no error, sometimes with "some" per day, that's perfectly fine with the FC protocol and well within the specifications. It could lead to single temporary and recoverable errors on a host, but nothing has to be replaced then as long as the rate doesn't increase significantly. You wouldn't replace your one-year-old tires just because the tread is only 90% of what it was when you bought them.
Let's think a little bit greener - even in switch maintenance :o)
In the last couple of years beside of the buzzwords "cloud", "big data" and "VAAI" there is another topic that plays a big role in every discussion about storage products: "easy management". In most cases it means an intuitive and catchy graphical user interface that would allow even children to manage a storage array - if you believe marketing. Along with that goes the integration of storage management tasks into the GUI of the servers temselves and of course automation of these tasks. If the highly skilled server and storage administrators don't have to invest their time into disproportionately laborious routine tasks anymore they could focus on more advanced projects.
But many companies still fight the impacts of the financial crisis. This leads to: vacant posts get dropped, teams get consolidated and cut down. The CIOs want to see the synergy effects in numbers and decreasing headcounts. Former specialized experts have to cope with more and more different systems. Less time, more work, less education, more stress, less productivity, more trouble - a downward spiral. Beside of that classical admin's work is offshored or outtasked to operating and monitoring teams with no more than broad, general skills.
In the technical support I see the effect of that "evolution" in the problem descriptions of current cases: "We see SCSI messages in the host." Or even just "We see messages. Could be the SAN." Administrators with a foundational ITIL certificate but no clue about what a Read(10) is are suddenly confronted with a host running amok with just some obscure rough messages about its storage in the logs. To ensure a quick resolution of the problem priority 1 would be to know what these messages actually mean. Often they are just forwarded from the device driver and there is no good documentation available explaining it properly. Or there is just something like "blabla ...then go to your service provider", not even mentioning which one - out of the broad bouquet one with a heterogenous infrastructure might have - this would be. If the admin lacks a fundamental understanding about the storage concepts and protocols then, he will not be able to get any senseful information out of that. And "randomly" has to pick a support organization for any of the involved machines.
The result: Long & critical outages.
So the colorful dynamic easy-to-use management interfaces protected us from the ugly technical abyss in the lower layers for the longest time. But now as there is a problem, we only get some strange sense data and don't know who could help us further. And it's the same with managing changes in the infrastructure. A lot of the problems opened at the SAN support are in fact mis-configurations, user mistakes or unrealistic expectations born out of conceptual misunderstandings. "We need this 300km synchronous mirror connection to run with 3ms latency max. We bought your enterprise SAN gear. Why is it not fast enough?". The same with slow drain devices. If a SAN admin (with also the server admin's and storage admin's hat on his head) has no idea about the traffic flow in a SAN and buffer-to-buffer credits, how could he understand the impact of a slow drain device in his environment?
That's why clouds and Storage aaS, IaaS or even SaaS are so important today. Not because of the elastic and dynamic deployment or the transparancy of the costs. But because there are less and less people with deep technical background knowledge about storage and SANs available in the companies. They seemed to be superfluous as long as everything was running fine and an un-skilled person was enough to make the few clicks in the GUI. So the only escape and the consequent next step is to move to the cloud concept.
Am I a cloud fanboy?
I wouldn't call me a fanboy. I'm a support guy and I like to troubleshoot as effectively as possible to solve a problem as quickly as possible. And to enable me doing this, I need a skilled local counterpart who is able to collect the data and to execute the action plans, who is also able to address problems to the proper support provider and to proactively monitor the environment. So if there is a classical data center with a team of skilled administrators, I'm quite happy. But if not, this "vacuum" should be filled to minimize the risk of major outages. The provider of a public cloud would have such a team.
And in private clouds?
In a well-defined and highly automatized private cloud, the remaining (most probably much smaller) team of skilled admins doesn't have to care for provisioning of LUNs and other standard tasks anymore. They would have more time for digging deeper into the stuff. You might argue now that this just repeats the story of the easy management above. Right! But as soon as you entered this path and as long as the external constraints don't change, this is the only way to go. And for some of the companies out there a private cloud might just not be the best choice and other options like outsourcing would come into play.
The most important thing is to face the truth and to make a honest review of the skills available. Your data is your most precious asset and availability is crucial. If that path leads to the cloud, there is no reason to stop now. Don't wait for the next outage!
I blog for a while now. Looking back I had a personal blog about things I'm interested in some years during my study. I did a comedic fake news page, too. My wife and I write a blog about our baby and I also have an IBM internal blog about SAN troubleshooting. Last year I started with seb's sanblog on developerworks and it was quite a slow start. Beginning of 2011 there was much stuff to do for my primary job on one hand but on the other hand my daughter was born and my interests shifted a bit. As I write the articles for this blog mainly during my spare time, the simple equation was: no spare time = no blog posts.
Midyear 2011 the situation improved a bit. My baby Johanna was out of the woods somehow (is "to be out of the woods" really the English term for finishing the most stressful phase?) after her hip dysplasia was cured and I was able to really start to blog. And then I thought about: What do you want to blog about? There is so much going on in the storage industry, but am I really the best person to blog about them? Can I really add some value with blog articles here? I don't think so. Of course I comment on such topic on other people's blogs, twitter or social platforms like linkedin from time to time. After all there's always some FUD around I cannot resist to comment. But I try to keep my own blog really about SAN and storage virtualization with a focus on troubleshooting.
I wrote 19 articles in 2011. That's not much compared to let's say storagebod. Why is that so? Well, for me it's quite a balancing act what I can blog about. Of course I can't blog about a specific customer having a problem. That's a no-go. There are also things I don't want to blog about because there is already much out there about it. And then there is stuff that I just can't blog about, because it's internal information. Special troubleshooting procedures I created for example or information about internal tools and projects I'm involved.
What remains then?
Oh, there's still enough to blog about. If I notice situations like "Hey, I explained this general thing in four cases now to customers completely unaware of it." or if I see a feature that could really help admins but hardly anyone uses it so far, then I write a blog article. I see it more as an additional explanation and food for thought. My target audience consists of customers on the "doing level" (admins, architects) as well as people troubleshooting SANs. I know that's a significantly smaller group than the audience of the more general storage bloggers, but I'm happy if the right people read it and I get the feedback that my blog helped them with their problems. However I started to count the visitors internally since end of July and so far around 32000 visited seb's sanblog. That's not too bad, I think.
Writing such a résumé I want to thank the people who inspired me to start a blog. First of all there are Barry Whyte and Tony Pearson with their developerworks blogs showing me: there are actually IBMers out there writing about my topics of interest. Reading their blogs brought me to many others - also from other companies - that I try to look in daily. Most of them you see in the list on the right bar of this blog. But a special Thank you! goes out to my Australian colleague Anthony Vandewerdt whose blog has a big focus on the people really working with IBM storage products and therefore SAN products as well. His Aussie Storage Blog on developerworks triggered my decision to start an own external blog. Thank you again!
So what to expect from 2012?
To be honest, I have no idea :o) There is no overall plan. No weeks-long article pipeline. I'm not invited in blogger events or something like that and my blog is in no way a marketing channel for upcoming IBM products. Everything I write is just born out of my experience with SAN products and troubleshooting. I try not to write too much about hypes and trends, except it has a direct impact on SAN - like oversaturated hypervisors turning to slow drain devices or Big Data as an excuse to do some really weird things with your storage architecture :o)
Are you still interested?
Then be my guests in 2012 and if you feel the urge to say something about, against or additional to an article, don't hesitate to leave a comment! Have a nice start into the New Year!
Everyone is talking about cloud security these days. Is it clever to give my data outside my own data center? To another company? Maybe even outside the country? How safe and secure is that? Not only the way in between but also then there? Are they protected enough? Are they able to block intruders both remotely and locally? And what about attackers from within the cloud service provider? The discussion is so full of - indeed reasonable - concerns that I started to wonder.
Why do I often see SANs that are not secured at all?
I don't mean the physical access control to the machines themselves. Usually companies take that one seriously. But all the other aspects of SAN security are often disregarded according to my experience. If there is no statutory duty or the enforcement of compliance it's just a variable in the risk calculation about costs of security, probabilities and inexplicable consequences in case of security breaches. And taking also budget constraints and lack of skill and manpower into consideration SAN security is often treated as an orphan.
There is a huge market for IP security with firewalls, intrusion detection systems, DMZs, honeypots and hackers with hats in all colors of the rainbow. If a famous company is hacked or victim of a huge DDOS attack you probably read that in the IT news. But if a company has an internal security breach in their storage infrastructure they'll hardly let the public know about it.
What to do from SAN point of view?
There are multiple aspects and possibilities to secure a SAN. Let's take Brocade switches as an example and let's see what could happen...
1.) Management access control
From time to time I get a request for a password reset and the switch's root account is still on the default password. THAT'S. NOT. COOL! It's really unlikely, because in all current FabricOS versions the admin gets the prompt to change the passwords for all four pre-configured user accounts of the switch if it's on the defaults. But it still happens every now and then.
It's the same like for all other devices with user management in IT: Choose passwords, which are hard to guess, can't be found in a dictionary, contain non-alphanumeric characters and so on. Change passwords from time to time, like in a 90 days interval. Most switches support RADIUS and LDAP. The ipfilter command allows you to block telnet, enforcing the use of ssh. In addition for FabricOS v7.0x it's officially supported now to have a plain key-based ssh access for more than one user, too.
And don't stick with old switches from generations ago. Not only the lower linerate and the small feature set should be considered here, but security, too. If the firmware is very old, it's also based on old components like legacy versions of openssh & Co. Very concerning security holes have been fixed over the years. You can check the installed versions of these components here. And yes, it is quite easy to see the password hashes without the root user, but at least they are salted in the current firmwares.
Security is not only about passwords, it's about user roles, too. In the Brocade switches you can define user rights with high granularity, the DCFM has its "resource groups" and the Network Advisor works with "areas of responsibility". Use them to choose wisely who can do what. You don't want to have another Terry Childs case in the media and this time about your company, do you?
The only thing I miss for many SAN switches and other storage equipment is a real, robust and trustworthy accounting or audit log. I want to see what was done on the switch and by whom. Not only what they did via CLI, but via webinterfaces, management applications and shell-less CLI accesses, too. Is there no standard to have these data automatically forwarded to an internal, trusted collection server via a secured connection? Really?
You should encrypt your traffic. There are several possibilities to catch the signal without your knowledge, especially if your data leaves your controlled ground on the way to a remote DR location. For FCIP traffic you should always use encryption. Indisputable. And for plain fibre-based FC longdistance connections? You probably say "Hey, it's transparent and it's optical fibre, not electrical. You can't just dig a hole, rip the cadding of the cable and splice a second cable in." - You have no idea. Keep in mind that the data traversing the SAN is the really important and thus precious kind in your company. There are technical possibilities to do it and if there is opportunity, there could be a criminal mind using it. This perception seems to gain acceptance among the switch vendors more and more. For example Brocade's current 16G equipment is able to have encrypted ISLs for that matter. Of course all vendors sell SAN based encryption appliances or switches, too. This way not only the inter-location traffic is encrypted, but also for the data on the disk or tape. So if there would ever be the chance that some unauthorized person gets his hands on the storage, he won't be able to read the data.
3.) Fabric access control
What would be the easiest thing to work around passwords and encryption if an intruder would have physical access to a data center? (Just like a student employee, a temp worker, an intern, an external engineer... I think you get the point) He could simply spot a free port on a switch and connect a switch he brought in. Setting up a mirror port or changing the zoning to gain access to disks and doing some other nasty things is quite easy.
How to avoid that?
FICON environments for mainframe traffic always had higher security demands and we can use just the same features for open systems as well. There are security policies allowing us to control which devices are allowed to be connected to the fabric (DCC - device connection control), which switches can be part of the fabric (SCC - switch connection control) and which switches can modify the configuration (FCS - fabric configuration server). In addition the current Brocade FabricOS versions support DH-CHAP and FCAP using certificates for authentication.
If you want to utilize the features and mechanisms described above, the FabricOS Administrator's guide provides some good descriptions and procedures to begin with. Of course IBM offers technical consulting services to help you to secure your SAN properly.
So if you are concerned if the provision model your IT could be based on in the future is secure, you should be even more concerned about the security of your SAN today!
(Disclaimer: SAN switches from other vendors may have the same or similar security features, too. I just chose Brocade switches because of their prevalance within IBM's SAN customer base.)
When Brocade released FabricOS v6.0 in 2007 Quality of Service sounded like a great idea: It allows you to prioritize your traffic flow to the level of certain device pairs. There are 3 levels of priority:
High - Medium - Low
Inter Switch Links (ISLs) are logically partitioned into 8 so called Virtual Channels (VCs). Basically each of them has its own buffer management and the decision which virtual channel a frame should use is based on its destination address. If a particular end-to-end path is blocked or really slow, the impact on the communication over the other VCs is minimal. Thus only a subset of devices should be impaired during a bottleneck situation.
Quality of Service takes this one step further.
QoS-enabled ISLs consist of 16 VCs. There are slightly more buffers associated with a QoS ISL and these buffers are equally distributed over the data VCs. (There are some "reserved" VCs for fabric communication and special purposes). The amount of VCs makes the priority work - the most VCs (and therefore the most buffers) are dedicated to the high priority, the least for the low one. Medium lies in the middle obviously. So more important I/Os benefit from more resources than the not so important ones.
Sounds like a great idea!
Theoretically you can configure the traffic flow in terms of buffer credit assignment in your fabric very fine-grained. But that's in fact also the big crux: You have to configure it! That means you actually have to know which host's I/O to which target device should be which priority. Technically you create QoS-zones to categorize your connections. Low priority zones start with QOSL, high priority zones start with QOSH. Zones without such a prefix are considered as medium priority.
But how to categorize?
That's the tricky part. The company's departments relying on IT (virtually all) have to bring in their needs into the discussion. Maybe there are already different SLAs for different tiers of storage and an internal cost allocation in place. The I/O prioritization could go along with that and of course it has to be taken into account to effectively meet the pre-defined SLAs. If you have to start from the scratch, it's more a project for weeks and months than a simple configuration. And there is much psychology in it. Beside of that you really have to know how QoS works in details to design a prioritization concept. For example if you only have 20 high priority zones and 50 with medium priority but only 3 low priority zones, the low ones could even perform better. In the four years since its release I saw only a couple of customers really attempting to implement it.
In addition you need to buy the Adaptive Networking license!
So why should I care?
If QoS is such a niche feature, why blogging about it? Usually a port is configured for QoS when it comes from the factory. You can see it in the output of the command "portcfgshow". A new switch will have QoS in the state "AE" which means auto-enabled - in other words "on". An 8Gig ISL will be logically partitioned into the 16 VCs as described above and the buffer credits will be assigned to the high, the low and the medium priority VCs. But that does not mean that you can actually benefit from the feature, because you most probably have no QoS-zones! And so all your I/O share only the resources allocated for the medium priority. A huge part of the available buffers are reserved for VCs you cannot use! So as a matter of fact you end up with less buffers than without QoS and in many cases this made the difference between smooth running environments and immense performance degradation.
If you don't plan to design a detailed and well-balanced concept about the priorities in your SAN environments, I recommend to switch off QoS on the ports. I don't say QoS is bad! In fact with the Brocade HBA's possibility to integrate QoS even into the host connection - enabling different priorities for virtualized servers - you have the possibility to better cope with slow drain device behavior. But done wrong, QoS can have a very ugly impact on the SAN's performance!
Better know the features you use well - or they might turn against you...
As this was not clear enough in the text above and I got back a question about that, please be aware: Disabling QoS is disruptive for the link! In most FabricOS versions in combination with most switch models, the link will be taken offline and online again as soon as you disable it. In some combinations you'll get the message that it will turn effective with the next reset of the link. In that case you have to portdisable / portenable the port by yourself.
As this is a recoverable, temporary error your application most probably won't notice anything, but to be on the save side, you should do it in a controlled manner and - if really necessary in your environment - in times of little traffic or even a maintenance window. The command to disable it is:
portcfgqos --disable PORTNUMBER
Performance problems are still the most malicious issues on my list. They come in many flavors and most of them have two things in common: 1) They are hardly SAN defects and 2) They need to be solved as quickly as possible, because they really have an impact.
If just a switch crashed or an ISL dropped dead or even an ugly firmware bug blocks the communication of an entire fabric, it might ring all alarm bells. But that's something you (hopefully) have your redundancy for. Performance problems on the other hand can have a high impact on your applications across the whole data center without a concerning message in the logs, if your systems are not well prepared for it. Beside of the preparation steps I pointed out here there is a tool in Brocade's FabricOS especially for performance problems: The bottleneck monitor or short:
If a performance problem is escalated to the technical support the next thing most probably happening is that the support guy asks you to clear the counters, wait up to three hours while the problem is noticeable, and then gather a supportsave of each switch in both fabrics.
Why 3 hours?
A manual performance analysis is based on certain 32 bit counters in a supportsave. In a device that's able to route I/O of several gigabits per second, 32 bits aren't a huge range for counters and they will eventually wrap if you wait too long. But a wrapped counter is worthless, because you can't tell if and how often it wrapped. So all comparisons would be meaningless.
Beside the wait time the whole handling of the data collections including gathering and uploading them to the support takes precious time. And then the support has to process and analyze them. After all these hours of continously repeating telephone calls you get from management and internal and/or external customers, the support guy hopefully found the cause of your performance problem. And keeping point 1) from my first paragraph in mind, it's most probably not even the fault of a switch*). If he makes you aware to a slow drain device, you would now start to involve the admins and/or support for the particular device.
You definitely need a shortcut!
And this shortcut is the bottleneckmon. It's made to permanently check your SAN for performance problems. Configured correctly it will pinpoint the cause of performance problems - at least the bigger ones. The bottleneckmon was introduced with FabricOS v6.3x and some major limitations. But from v6.4x it eventually became a must-have by offering two useful features:
Congestion bottleneck detection
This just measures the link utilization. With the fabric watch license (pre-loaded on many of the IBM-branded switches and directors) you can do that already for a long time. But the bottleneckmon offers a bit more convenience and brings it in the proper context. The more important thing is:
Latency bottleneck detection
This feature shows you most of the medium to major situations of buffer credit starvation. If a port runs out of buffer credits, it's not allowed to send frames over the fibre. To make a long story short if you see a latency bottleneck reported against an F-Port you most probably found a slow drain device in your SAN. If it's reported against an ISL, there are two possible reasons:
- There could be a slow drain device "down the road" - the slow drain device could be connected to the adjacent switch or to another one connected to it. Credit starvation typically pressures back to affect wide areas of the fabric.
- The ISL could have too few buffers. Maybe the link is just too long. Or the average framesize is much smaller than expected. Or QoS is configured on the link but you don't have QoS-Zones prioritizing your I/O. This could have a huge negative impact! Another reason could be a mis-configured longdistance ISL.
Whatever it is, it is either the reason for your performance problem or at least contributing to it and should definitely be solved. Maybe this article can help you with that then.
With FabricOS v7.0 the bottleneckmon was improved again. While the core-policy which detects credit starvation situations was pretty much pre-defined before v7.0 you're now able to configure it in the minutest details. We are still testing that out more in detail - for the moment I recommend to use the defaults.
So how to use it?
At first: I highly recommend to update your switches to the latest supported v6.4x code if possible. It's much better there than in v6.3! If you look up bottleneckmon in the command reference, it offers plenty of parameters and sub-commands. But in fact for most environments and performance problems it's enough to just enable it and activate the alerting:
myswitch:admin> bottleneckmon --enable -alert
That's it. It will generate messages in your switch's error log if a congestion or a latency bottleneck was found. Pretty straightforward. If you are not sure you can check the status with:
myswitch:admin> bottleneckmon --status
And of course there is a show command which can be used with various filter options, but the easiest way is to just wait for the messages in the error log. They will tell you the type of bottleneck and of course the affected port.
And if there are messages now?
Well, there is still the chance, that there are actually situations of buffer credit starvation the default-configured bottleneckmon can't see. However as you read an introduction here, I assume you just open a case at the IBM support.
You'll Never Walk Alone! :o)
*)Depending on country-specific policies and maintenance contracts a performance analysis as described above could be a charged service in your region.
There are some goodies in FOS 7.0 that are not announced big-time. Goodies especially for us troubleshooters. There are regular but not too frequent so called RAS meetings. Here we have the possibility to wish for new RAS features - wishes born out of real problem cases. Some of the wishes we had were implemented in FOS 7.0 (beside of the Frame Log I already described in a previous post).
Time-out discards in porterrshow
You probably noticed that I have a hobbyhorse when it comes to troubleshooting in the SAN: performance problems. Medium to major SAN-performance problems usually go along with frame drops in the fabric. If a frame is kept in a port's buffer for 500ms, because it can't be delivered in time, it will be dropped. So these drops would be a good indicator for a performance problem. There is a counter in portstatsshow for each port (depending on code version and platform) named er_tx_c3_timeout, which shows how often the ASIC connected to a specific port had to drop a frame that was intended to be sent to this port. It means: This guy was busy X times and I had to drop a frame for him.
But who looks in the portstatsshow anyway? At least for monitoring? In that area the porterrshow command is way more popular, because it provides a single table for all FC ports showing the most important error counters. Unfortunately it had only one cumulative counter for all reasons of frame discards - and there are a lot more beside of those time-outs. But now there are two additional counters in this table: c3-timeout tx and c3-timeout rx. Out of them the tx counter is the important one as described above. The rx counter just gives you an idea where the dropped frames came from.
So: just focus on the TX! If it counts up, get some ideas how to treat it here.
The firmware history
Just last week I had a fiddly case about firmware update problems again. There are restrictions about the version you can update to based on the current one. If you don't observe the rules, things could mess up. And they could mess up in a way you don't see straightaway. But then suddenly, after some months and maybe another firmware update, the switch runs into a critical situation. Or it has problems with exactly that new firmware update. Some of these problems can render a CP card useless, which is ugly because from a plain hardware point of view nothing is broken. But the card has to be replaced at the end. Sigh.
To make a long story short: Wouldn't it be better to actually know the versions the switch was running on in the past? And that's the duty of the firmware history:
switch:admin> firmwareshow --history
Firmware version history
Sno Date & Time Switch Name Slot PID FOS Version
1 Fri Feb 18 12:58:06 2011 CDCX16 7 1556 Fabos Version v7.0.0d
2 Wed Feb 16 07:27:38 2011 CDCX16 7 1560 Fabos Version v7.0.0a
(example borrowed from the CLI guide)
No access - No problem
There is a mistake almost everybody in the world of Brocade SAN administration makes (hopefully only) once: Trying to merge a new switch into an existing fabric and fail with a segmented ISL and a "zone conflict". Then the most probable reason is that the new switch's default zoning (defzone) is set to "no access".
This feature was introduced a while ago to make Brocade switches a little more safe. Earlier each port was able to see every other port as long as there was no effective zoning on the switch. With "no access" enabled, all traffic between each unzoned pair of devices is blocked if there is no zone including them both. The drawback of "no access" is its technical implementation, though. As soon as it was enabled a hidden zone was created and its pure existence blocked the traffic for all unzoned devices. And so without any indication the switch did end up with a zone.
But entre nous: no sane person accepts this without raising a few eyebrows. With FOS 7.0 this (mis-)behavior is gone. The new switch has a "no access" setting and wants to merge the fabric? Fine. You don't have to care, the firmware cares for you!
Thanks for the little helpers Brocade - and I hope you stay open for new ideas :o)
Many of you (at least many of the few really reading this stuff) may already know what CRC is. But I think it doesn't hurt to have a short recap. CRC means Cyclic Redundancy Check and can be used as an error detection technique. Basically it calculates a kind of hash value that tends to be very different if you change one or more bits in the original data. Beside of that it's quite easy to implement. I once wrote a CRC algorithm in assembler (but for the Intel 8008) during my study and it was a nice exercise for optimization.
What has that got to do with SAN?
In Fibre Channel we calculate a CRC value for each frame and store it as the next-to-last 4 bytes before the actual end of frame (EOF). The recipient will read the frame bit by bit and meanwhile it calculates the CRC value by itself. Reaching the end of the frame it knows if the CRC value stored there matches the content of the frame. If this is not the case, it knows that there was at least one bit error and it is supposed to be corrupted and thus can be dropped. Now if the recipient is a switch the next thing to happen depends on which frame forwarding method is used:
The switch reads the whole frame into one of its ingress ("incoming") buffers and checks the CRC value. If the frame is corrupted the switch drops it. It's up to the destination device to recognize that a frame is missing and at least the initiator will track the open exchange and starts error recovery as soon as time-out values are reached. Many of the Cisco MDS 9000 switches work this way. It ensures that the network is not stressed with frames that are corrupted anyway, but it's accompanied with a higher latency. From a troubleshooting point of view the link connected to the port reporting CRC errors is most probably the faulty one.
To decrease this latency the switch could just read in the destination address and as soon as that one is confirmed to be zoned with the source connected to the F-port (a really quick look into the so called CAM-table stored within the ASIC) it goes directly on the way towards the destination. So if everything works fine - enough buffer-credits are available - the frame's header is already on the next link before the switch even read the CRC value. The frame will travel the whole path to the destination device even though it's a corrupted frame and all switches it passes will recognize that this frame is corrupted. Brocade switches work this way. As soon as the corrupted frame reaches the destination, it will be dropped.
Regardless which method is used, the CRC value remains just an error detection and most probably the whole exchange has to be aborted and repeated anyway.
So how to troubleshoot CRC errors on Brocade switches then?
If you would only have a counter for CRC errors, you would be in trouble now. Because if all switches along the path increase their CRC error counter for this frame, how would you know which one is really broken? If you have multiple broken links in a huge SAN, this could turn ugly. But there are 2 additional counters for you:
- enc in - The frame is encoded additionally in a way that bit errors can be detected. And because the frame is decoded when it's read from the fiber and encoded again before it's sent out to the next fiber, the enc in (encoding errors inside frames) counter will only increase for the port that is connected to the faulty link.
- crc g_eof - Although a corrupted frame will be cut-through as explained above, there is just one thing the switch can do in addition when it encounters a mismatch between the calculated CRC value and the one stored in the frame: it will replace the EOF with another 4 bytes meaning something like "This is the end of the frame, but the frame was recognized as corrupted." The crc g_eof counter basically means "The CRC value was wrong but nobody noticed it before. Therefore it still had a good EOF." So if this counter increases for a particular link, it is most probably faulty.
frames enc crc crc too too bad enc disc link loss loss frjt fbsy
tx rx in err g_eof shrt long eof out c3 fail sync sig
1: 1.5g 1.8g 13 12 12 0 0 0 1.1m 0 2 650 2 0 0
2: 1.3g 1.4g 0 101 0 0 0 0 0 0 0 0 0 0 0
3: 1.9g 2.9g 82 15 0 0 3 12 847 0 0 0 0 0 0
Port 1 shows a link with classical bit errors. You see CRC errors and also enc in errors. Along with them you see
crc g_eof. Everything as expected. Just go ahead and and check / clean / replace the cable and/or SFPs. There are some tests you could do to determine which one is broken like "porttest" and "spinfab".
Port 2 is a typical example of an ISL with forwarded CRC errors. This ISL itself is error-free. It just transported some previously corrupted frames (crc err but no enc in) which were already "tagged" as corrupted, hence no crc g_eof increases.
Port 3 is a bit tricky now. If you just rely on crc g_eof it seems to be a victim of forwarded CRC errors, too. But that's not the case. Actually they were broken in a manner that the end of the frame was not detected properly, so too long an bad eof is increased. Best practice: Stick with the enc in counter. It still shows that the link indeed generates errors.
Hold on, Help is on the way!
Now with 16G FC as state of the art things changed a bit. It uses a new encoding method and it comes with a forward error correction (FEC) feature. Brocade provides this with its FabricOS v7.0x on the 16G links. It will be able to correct up to 11 bits in a full FC frame. FEC is not really highlighted or specially standing out in their courses and release notes, but in my opinion this thing is a game changer! Eleven bit errors within one frame! Based on the ratio between enc in and crc err - which basically shows how many bit errors you have in a frame on the average - we see so far, I assume this to just solve over 90% of the physical problems we have in SANs today. Without the end-device-driven error recovery which takes ages in Fibre Channel terms. Less aborts, less time-outs, less slow drain devices because of physical problems! If this works as intended SANs will reach a new level of reliability.
So let's see how this turns out in the future. It might be a bright one! :o)
Modified on by seb_
The Storwize V7000 and the SVC (SAN Volume Controller) share the same code base and therefore the same error codes. Many of them indicate a failure condition in this very machine, but there are others just pointing to an external problem source. The error 1370 is one of the second kind. There is not really much information about it in the manuals but in fact it could give you a good understanding about what's going wrong.
As storage virtualization products the SVC and the V7000 - if you use it to virtualize external storage - are actually the hosts for the external storage. Speaking SCSI they are the initiators and the external backend storage arrays are the targets. Usually the initiators monitor their connectivity to the targets and do the error recovery if necessary. And so the SVC and the V7000 focus on monitoring the state of their backend storage and can actually help you to troubleshoot them.
So you have 1370 errors, what now?
They come in two flavors: The event id 010018 (against an mdisk) and the event id 010030 (against a controller - aka storage array). I'll explain the 010030 as it's easier to understand but understanding it will give the insight to understand the 010018, too.
If you double-click the 1370 in your event log, you see the details of the error:
You see the reporting node and the controller the error is reported against. But the most important thing is the KCQ. The Sense Key - Code - Qualifier.
Imagine this situation: The SVC is the initiator. It sends an I/O towards the storage device - the target. But the target faces a "note-worthy" condition at the very moment. So it will make the initiator aware of it by sending a so called "check condition". As curious as it is, the initiator wants to know the details and requests the sense data. These sense data will now be stored in - you already guess it - a 1370 in the format Key - Code - Qualifier. Often the last both are referred to as ASC (Additional Sense Code; the green one) and ASCQ (Additional Sense Code Qualifier; the blue one).
Where's the Rosetta Stone?
These sense data can be translated using the official SCSI reference table by Technical Commitee T10 (the council making the SCSI protocol). If you encounter an ASC/ASCQ combination in a 1370 that can't be found in that list, it's most probably a vendor specific one. In that case the manufacturer of the target device could give you more information about it.
Back to our example. So you see the ASC 29 (the "Code") and the ASCQ 00 (the "Qualifier") here. Looking that up in the list reveals: It's a "POWER ON, RESET, OR BUS DEVICE RESET OCCURRED". This so called "POR" should make you aware that the target was recently either powered on or did a reset. Usually the initiator gets this with the first I/O it does against the target after such an event, to be aware that any open I/O it has against this target is voided and has to be repeated.
Ah, okay. That's it?
No! You see the orange box? This is the time since this sense data was received. The unit is 10ms, so this number actually represents a long time since there really was a POR for this controller.
So why do we have a 1370 today?
The 1370 is more of a container for sense data. The number behind the attributes show the "slot". So the information visible here are for the first slot and as such a long time passed since it occurred it's meaningless for us now. Let's scroll down a bit:
In the second slot you see what's really going wrong within the external storage device at the moment, because the time value is 0. That means the 1370 was triggered because of it. And it contains a different set of sense data. ASC 0C / ASCQ 00! If you try to look it up in the list, you will find 0C/00, but hey - this cannot be! The combination 0C/00 means "WRITE ERROR", but it's not defined for "Direct Access Block Devices" like storage arrays.
A Dead End?
No, of course not. In this example the storage is a DS4000. Just download the DS4000 Problem Determination Guide and it will provide an ASC/ASCQ table. Here you'll see that 0C 00, together with the Sense Key 06 (the red circle) means "Caching Disabled - Data caching has been disabled due to loss of mirroring capability or low battery capacity."
Running without the cache in the backend storage could lead to severe performance degradation and should definitely be troubleshooted! Without even looking into the backend storage you already know what's going wrong there! No need to involve SVC or V7000 support this time. Just focus on the backend storage and find out why the caching is disabled.
So please don't shoot this messenger, it just tries to help you!
Update - December 2nd 2013
The SCSI Interface Guide for IBM FlashSystem can be found here.
Time for another piece of my little series! This time I'd like to write about a new feature in v7.0x especially for administrators and support personnel: The Frame Log. Maybe it's a bit early to write about it, because it seems to be a feature "in development" at the moment, but I did wait for it so long I'm just not able to resist. I think and I hope Brocade will further develop it like the bottleneckmon - which I was very sceptical about in its first version when it was released in the v6.3 code. After seeing its functionality being extended on v6.4 and even more in v7.0, the bottleneckmon is an absolute must-have.
Hmm... maybe I should write an article about bottleneckmon, too :o)
Back to the Frame Log. So what's that?
Basically it is a list of frame discards. There are several reasons why a switch would have to drop a frame instead of delivering it to the destination device. One of them is a timeout. If a frame sticks in the ASIC (the "brain" behind the port) for half a second, the switch has to assume that something's going wrong and so the frame cannot be delivered in time anymore. Then it drops it. Till FabOS v7.0 it just increased a counter by one. Since later v6.2x versions it was at least logged against the TX port (the direction towards the reason for the drop) - in earlier versions the counter increased only for the origin port, which made no sense at all. But now we even have a log for it! A log to store all the frames the switch had to discard. While that sounds a bit like rummaging through the switch's trash bin, the Frame Log is very useful for troubleshooting though. It contains the exact time, the TX and the RX port (keep in mind the TX is the important one) and even information from the frame itself. In the summary view you see the fibrechannel addresses of the source device (SID) and of the destination device (DID).
For example to see the two most recent frame discards in summary mode, just type:
B48P16G:admin> framelog --show -mode summary -n 2
Fri Sep 23 16:07:13 CET 2011
Log TX RX
timestamp port port SID DID SFID DFID Type Count
Sep 29 16:02:08 7 5 0x040500 0x013300 1 1 timeout 1
Sep 29 16:04:51 7 1 0x030900 0x013000 1 1 timeout 1
In the so called "dump mode" you even see the first 64 bytes of each frame. Usually I have to bring an XGIG tracer onsite to catch such information and often it's not even possible to catch it then, because an XGIG can only trace what's going through the fibre. So you'll only see this frame if you trace a link it crosses before it is dropped. And even then you can't trigger (=stop) the tracer directly on this event, but you have to have it looking for a so called ABTS (abort sequence). If a frame is dropped the command will time out in the initiator and it will send this ABTS. Depending on what frame exactly was dropped in what direction, the ABTS could be on the link several minutes after the actual drop of the frame. Imagine a READ command being dropped. The error recovery will start after the SCSI timeout which could be e.g. 2 minutes. But 2 minutes is a long time in a FC trace. Chances are good that the tracer misses it then.
Not so with the Frame Log!
The frame log can tell you exactly which frame was dropped. If you try to find out if a particular I/O timeout in your host was caused by a timeout discard in the fabric, this is your way to go. If you see your storage array complaining about aborts for certain sequences, just look them up in the Frame Log. With this feature Brocade finally catches up with Cisco and their internal tracing capabilities - and Brocade does it way more comfortable for the admin. The logging of discarded frames is enabled by default and it works on all 8G and 16G platform switches without any additional license.
The big "BUTs"
As I mentioned at the beginning of this article there are still things for Brocade to work on to turn the Frame Log into a must-have tool like the bottleneckmon. The first catch is its volatility. In the current version it can only keep 50 frames per second on an ASIC base for 20 minutes in total. At the moment I personally think that's too short. But I'll wait for the first cases where I can use it before I forge an ultimate opinion about this limit.
The other - more concerning - constraint is that it only works for discards due to timeout at the moment. So if a frame is dropped because of one of all the other possible reasons, it won't be visible in the Frame Log in its current implementation. But that's exactly what I need! If the switch discards a frame because of a zone mismatch or because the destination switch was not reachable or because the target device was temporarily offline or whatever - I want to see that. If a server is misconfigured (uses wrong addresses) and so cannot reach its targets, you'd see the reason right there in the target log - no tracing needed! There are plenty of other situations that would be covered with such a functionality. So I honestly hope that there is a developer with a concept like this in his drawer or even already within its implementation. Allow me to assure you that there is at least one support guy waiting for it...
The picture is from Zsuzsanna Kilian. Thank you!
Brocade recently released its 16G platform switches and along with them a new major version of FabricOS: FOS 7.0. Beside the new features customer's admins, architects or end-users might be interested in, I see some nice enhancements and new tools for us support people, too. In the next blog posts I would like to present some of them and show how to use them, why they are important and where they apply.
The first one I want to write about is the D-Port or Diagnostics Port. This is a special mode every port on Brocade's 16G platform can be configured to.
Why should I use it?
Imagine a two-fabric setup, both spread over two locations, connected via some trunked ISLs through a DWDM. Every once in a while I get a case like this where there was a problem with one of these ISLs. Usually the end-users report major performance problems, there might even be crashes of hosts. The SAN admin looks into his switches, the server admins look for their messages against their HBAs and quickly they notice that the problem seems to be in one fabric only and having a redundant second fabric available the decision is made: "Let's block the ISLs in the affected fabric. The workaround is effective, the situation calms down, the business impact disappears. But of course there is no redundancy anymore and the next step is to find out what happened and subsequently it has to be resolved.
So a problem case is opened at the technical support. The first request from the support people will be to gather a supportsave. Often they even request to clear the counters and wait some time before gathering the data.
But it's useless now!
Of course it's most important to stop any business impact by implementing a workaround as quickly as possible, but if I get a data collection like this, it's like being asked to heal a disease on the basis of a photo of an already dead person. Usually no customer will allow to re-enable the ISLs before the cause of the problem is found and solved. Welcome to a recursive nightmare! :o)
That's where D-Ports come into play
Having Diagnostics ports on both sides of the link will allow you to test a connection between two switches without having a working ISL. This means there will be no user traffic and also no fabric management over this link and so there will be no impact at all. From a fabric perspective, the ISL is still blocked. It comes with several automatic tests:
- Electrical loopback - (only with 16G SFP+) tests the ASIC to SFP connection locally
- Optical loopback - (with 16G SFP+ and 10G SFP+) tests the whole connection physically.
- Link traffic test - (with 16G SFP+ and 10G SFP+) does latency and cable length calculation and stress test
So this can even help you to determine the right setup for your long distance connection!
How to do it?
Although it's very easy to set this up in Network Advisor (only supported with 16G SFP+), as a support member I prefer stuff to be done via CLI, because then I can see it in the CLI history. (By the way, a real accounting or audit log covering both CLI and GUI actions would be very useful. I look at you, Brocade!) At first you should know which are the corresponding ports in the two switches. (The Network Advisor would do that for you.) Then you disable them on both sides using
Once disabled you can configure the D-Port:
portcfgdport --enable port
And finally enable it again using
Of course you would do that on both sides. There's a seperate command to view the results then:
B6510_1:admin> portdporttest --show 7
Remote WWNN: 10:00:00:05:33:69:ba:97
Remote port: 25
Start time: Thu Sep 15 02:57:07 2011
End time: Thu Sep 15 02:58:23 2011
Test Start time Result EST(secs) Comments
Electrical loopback 02:58:05 PASSED -- ----------
Optical loopback 02:58:11 PASSED -- ----------
Link traffic test 02:58:18 PASSED -- ----------
Roundtrip link latency: 924 nano-seconds
Estimated cable distance: 1 meters
If you see the test failing, you have your culprit and based on which one is failing actions can be defined to resolve the problem. Your IBM support will of course help you with that! :o)
So if you face similar problems and you are already using 16G switches with 16G SFP+ installed, feel free to implement a workaround like blocking the ISLs to lower the impact. The D-Port will help to find out the reasons afterwards.
But if you are still on 4G or 8G hardware and you want to disable the most probable guilty ports, then please PLEASE get me a supportsave first!
Better: Clear the counters, wait 10 minutes and then gather a supportsave before you disable the ports. And even better than that: Clear counters periodically as described here.