Modified on by seb_
The Storwize V7000 and the SVC (SAN Volume Controller) share the same code base and therefore the same error codes. Many of them indicate a failure condition in this very machine, but there are others just pointing to an external problem source. The error 1370 is one of the second kind. There is not really much information about it in the manuals but in fact it could give you a good understanding about what's going wrong.
As storage virtualization products the SVC and the V7000 - if you use it to virtualize external storage - are actually the hosts for the external storage. Speaking SCSI they are the initiators and the external backend storage arrays are the targets. Usually the initiators monitor their connectivity to the targets and do the error recovery if necessary. And so the SVC and the V7000 focus on monitoring the state of their backend storage and can actually help you to troubleshoot them.
So you have 1370 errors, what now?
They come in two flavors: The event id 010018 (against an mdisk) and the event id 010030 (against a controller - aka storage array). I'll explain the 010030 as it's easier to understand but understanding it will give the insight to understand the 010018, too.
If you double-click the 1370 in your event log, you see the details of the error:
You see the reporting node and the controller the error is reported against. But the most important thing is the KCQ. The Sense Key - Code - Qualifier.
Imagine this situation: The SVC is the initiator. It sends an I/O towards the storage device - the target. But the target faces a "note-worthy" condition at the very moment. So it will make the initiator aware of it by sending a so called "check condition". As curious as it is, the initiator wants to know the details and requests the sense data. These sense data will now be stored in - you already guess it - a 1370 in the format Key - Code - Qualifier. Often the last both are referred to as ASC (Additional Sense Code; the green one) and ASCQ (Additional Sense Code Qualifier; the blue one).
Where's the Rosetta Stone?
These sense data can be translated using the official SCSI reference table by Technical Commitee T10 (the council making the SCSI protocol). If you encounter an ASC/ASCQ combination in a 1370 that can't be found in that list, it's most probably a vendor specific one. In that case the manufacturer of the target device could give you more information about it.
Back to our example. So you see the ASC 29 (the "Code") and the ASCQ 00 (the "Qualifier") here. Looking that up in the list reveals: It's a "POWER ON, RESET, OR BUS DEVICE RESET OCCURRED". This so called "POR" should make you aware that the target was recently either powered on or did a reset. Usually the initiator gets this with the first I/O it does against the target after such an event, to be aware that any open I/O it has against this target is voided and has to be repeated.
Ah, okay. That's it?
No! You see the orange box? This is the time since this sense data was received. The unit is 10ms, so this number actually represents a long time since there really was a POR for this controller.
So why do we have a 1370 today?
The 1370 is more of a container for sense data. The number behind the attributes show the "slot". So the information visible here are for the first slot and as such a long time passed since it occurred it's meaningless for us now. Let's scroll down a bit:
In the second slot you see what's really going wrong within the external storage device at the moment, because the time value is 0. That means the 1370 was triggered because of it. And it contains a different set of sense data. ASC 0C / ASCQ 00! If you try to look it up in the list, you will find 0C/00, but hey - this cannot be! The combination 0C/00 means "WRITE ERROR", but it's not defined for "Direct Access Block Devices" like storage arrays.
A Dead End?
No, of course not. In this example the storage is a DS4000. Just download the DS4000 Problem Determination Guide and it will provide an ASC/ASCQ table. Here you'll see that 0C 00, together with the Sense Key 06 (the red circle) means "Caching Disabled - Data caching has been disabled due to loss of mirroring capability or low battery capacity."
Running without the cache in the backend storage could lead to severe performance degradation and should definitely be troubleshooted! Without even looking into the backend storage you already know what's going wrong there! No need to involve SVC or V7000 support this time. Just focus on the backend storage and find out why the caching is disabled.
So please don't shoot this messenger, it just tries to help you!
Update - December 2nd 2013
The SCSI Interface Guide for IBM FlashSystem can be found here.
Modified on by seb_
Well-made professional education is worth every cent, but in today's world controlled by CFOs everything costing money will be challenged sooner or later. And if you search for freebies you often end up with the first 3-4 sentences of an obviously good book about the topic and the prompt to register with your business information and email address. Weeks of business SPAM will follow then even if you unsubscribe again. Here are some good free books to get a good understanding of SAN switching and how it's implemented by the both big players Cisco and Brocade without the need to register for anything.
Introduction to Storage Area Networks and System Networking
Working at IBM I appreciate their Redbooks program. Experts from in- and outside IBM share their knowledge in for of these comprehensive ebooks. This one is a good introduction to SAN and how IBM does it. You learn how Fibre Channel works, the hardware, the software, the management, the use cases and the design considerations. And of course it covers the IBM products in that area, too.
Regular readers of my blog (are there any?) may know my opinion about the SNIA Dictionary, but for learning Storage Networking it's still a good source of definitions and explanations for many of the common terms and concepts. Get it directly from snia.org.
Cisco MDS 9000 Family Switch Architecture
This document is also known as “A Day in the Life of a Fibre Channel Frame” and I like it. Although it certainly saw some summers and winters since its release in 2006 but the general architecture is still the same. Of course everything is integrated and consolidated in the latest products, but if you ever understood how a frame is handled by a n older generation Cisco switch, it won't be a problem to work with, design for, or even troubleshoot the newest ones.
Brocade Fabric OS Administrator's Guide
While Brocade is certainly not revealing too much about the internals of their switches, the admin guide is still a good source of information about the Brocade features and implementations. Many SAN questions I'm asked in an average week could easily be answered by a glimpse into this guide. There is a new one for each new major codestream, so always look in the one for your installed FabricOS version. This is the link for FabricOS v7.2.
The remaining two ebooks on my list are specifically for performance troubleshooting... ...my hobbyhorse somehow.
Slow Drain Device Detection and Congestion Avoidance
This one is from Cisco and it covers the different types of performance problems pretty well. If you read the one about Cisco architecture before (see above) you can get much more out of this piece as well. It has some good example, troubleshooting approaches and explanations for the counters you might see. A definite must-read.
IBM Redpaper: Fabric Resiliency Best Practices
This one is about Brocade switches and the IBM version of their "SAN Fabric Resiliency Best Practices". After explaining the fundamentals about SAN performance it shows you how performance troubleshooting is done on a Brocade fabric, especially by using built-in features like bottleneckmon.
I'm sure there are many other good learning materials out there that don't exist for the sole reason to catch your contact addresses by registration. If you know some that should be on this list as well, please let me know. Thanks!
Modified on by seb_
The quality of the data collection is a significant factor for a quick and successful troubleshooting. Here in the remote support it's essential to get the data complete, well prepared, and quickly. Quickly is clear, but what do I mean with complete and well prepared?
Collecting data for a Cisco SAN switch case is not difficult if you know what to look for. That depends on the problem. The problem could still be ongoing or you want to have something analyzed that happened in the past. To avoid confusion between ongoing problems and historical stuff in the data, the counters need to be cleared. It seems like common sense but again and again I see data collections gathered in a wrong way rendering them useless for analysis.
The standard data collection for Cisco is a "showtech", to be exactly a "show tech-support details". It's a script with a lot of command outputs and it changed a lot over the hardware platforms and SAN-OS/NX-OS versions in the past. There were (and maybe will be again) bugs causing incomplete outputs like CSCus64671 which caused incomplete data under NX-OS 6.2 and was fixed in 6.2(11c). And that was not the only one! In addition some useful commands were never included into the script. So there's some extra work to do.
Do we look into these data directly? However much I like to dig into the guts of the data, there are things that machines can do better. For example compiling error tables for interfaces or running sanity checks against certain configurations. A colleague of mine and I are responsible for the tool that is used within IBM to analyze Cisco SAN data collections by creating a troubleshooting framework out of the data. Of course the quality of its output depends heavily on the quality of the input. The better the data, the better the tool can do its part and we, the support engineers, can do ours.
To cover all common situations, here is what I believe to be a good data collection plan:
The following command outputs should be gathered via CLI. Please log the (printable!) session output into one text file per switch per data collection round. On each switch start with setting the terminal length to zero to avoid pagewise output:
Switch# terminal length 0
2) Collecting data
Switch# show tech-support all
Switch# show tech-support details
That should give us most of the expected commands. To include internal counter tables and allow the analysis of historical data, please also run:
Switch# show logging onboard
If the problem could be related to the fiber optics (SFPs), like all physical problems incl. CRC errors, invalid transmission words, etc. please include:
Switch# show interface transceiver details
By having them all in one text file per switch you ensure them to be processed together properly. I highly recommend to use the following naming convention for the text files. It helps the IBM server to choose the proper support tool eliminating manual intervention and wait time by the support engineer.
The really important part is "_showtechSAN_" (including the underscores) but I recommend to use the full pattern to allow an easy identification of the proper data.
The text files of all switches can then be backed together in the same zip-file and uploaded to IBM (see the end of the article).
3) Clearing the counters
For ongoing problems it makes sense to clear the counters now. There are 3 major commands for that:
Regular interface counters:
Switch# clear counters interface all
Internal ASIC counters:
Switch# debug system internal clear-counters all
Switch# clear ips stats all
The first two should be done in any case, the third of course only if you have FCIP.
4) Wait time
Please wait then until the problem re-occurs. We don't have a fixed wait time here, but if the problem happens very seldom, it's advisable to clear the counters every few days to avoid catching unrelated stuff - for example high error counters caused by a maintenance action. The goal is to catch the real problem.
5) Collecting the data again
This is exactly like step 2).
6) Uploading the data
Please upload the created zip file(s) using our "Secure Upload" option here:
Just use your PMR (preferred), RCMS, or CROSS case number for your upload to let the system even notify the support engineer with an update to the case. It's also possible to upload data on the plain Machine Type / Serial Number, but then there won't be any direct correlation to the case. In the field "Upload is for:" always choose "Hardware" when you upload SAN data collections. The email address is optional but it will send you a short notice as soon as the upload completed successfully and the support engineer will be able to contact you via mail if needed. After clicking on "Continue" you can drag and drop the archive file containing the supportsave to upload it.
Short URL for this article: http://ibm.biz/ciscodc
Modified on by seb_
Almost a year ago I wrote an article about congestion bottlenecks in Brocade switches. I said you should avoid them, because they mean that you probably have no redundancy because of too much workload or you don't use it properly. You can use the bottleneckmon to detect them. On the other hand I cared much more about latency bottlenecks, often caused by slow drain devices and their implications. And so I do today.
Well...stop! Didn't you talk about congestion bottlenecks?
Yes! Today I want to explain how a congestion bottleneck could cause the exact same symptoms on the devices like a latency bottleneck - and exactly the same performance degradations. This is how it happens. In the middle you see a SAN director with 2 portcards and 2 core cards. While the devices are connected to the portcards, the core cards provide the backend connections between them. They are internally connected via the backplane. So for example host 1's way over there to the storage array A would traverse the portcard, then one of the both core cards and leaves the other portcard until it reaches storage array A. It could even be that two devices connected to the same portcard have to go over the core cards, because so called local switching is only done within an ASIC and a portcard could have more than one depending on the number of ports.
Now please meet host 2. Host 2 is a wonderful modern server. One of the work horses of the datacenter. It's fully packed with virtual machines, but its many cores and memory, as well as its state-of-the-art HBA, provide enough horsepower to cope with the workload. This baby is more than capable to do the work and it's in no way a slow drain device. It's zoned and mapped to the storage arrays A, B, C and D and it uses them heavily, mostly for read operations. The green tiny bars are read requests and as you see in the next picture it sends them to all of them, all of the time.
Of course the other hosts send requests, too, but let's focus on our diligent host 2. Yes, the pictures are too simplistic, but I'm sure you'll get the point. On the next one you see the first responses flowing back to host 2. Communicating with several storage arrays the link towards host 2 is used heavily, but host 2 is processing the incoming frames quickly and gives buffer credits back to the switch in proper time. So far so good.
But the more and longer the link utilization is very high, the more likely the following will happen if you enabled bottleneckmon with alerting:
2013/09/07-12:07:11, [AN-1004], 7002, SLOT 7 | FID 128, WARNING, FAB1DOM5, Slot 2, port 14 is a congestion bottleneck. 99.67 percent of last 300 seconds were affected by this condition.
If you didn't enable bottleneckmon, the congestion bottleneck would still be there... you just wouldn't know it.
The crux is: you will hardly find a congestion bottleneck that just flows with high link utilization and no negative effects. The probability is much higher for the following scenario:
Although there are enough buffer credits for this highly utilized link, frames are piling up towards it, because there is just too much workload and the link is busy sending frames. There is no slow drain device and to stay with the bathing metaphor: the drain works very well and transports as much water as its physically able to do. But there is so much more water in the tub that it could not go through the drain at the same time. And in addition imagine you have not only one water tap (in our case storage arrays) but four of them. They fill the tub quicker than the drain can empty it. As a result the internal buffers for all the hops through the SAN director fill up (that's basically the tub) and finally the director needs to do something against that: It will slow down the sending of buffer credits to the devices. Not only devices that want to send frames directly to host 2, but due to back pressure also the ones that send frames into that rough direction (using the same internal connections for example). And finally you'll end up in something like this:
The SAN director just behaves like a slow drain device itself!
Frames pile up inside the storage arrays and other end devices impaired by the slow drain behavior. If their RAS package is good, they will yell about credit starvation and probably even drop frames within their FC adaptors. In extreme situations these frame drops could happen in the director, too. At least you would see then something that would point you to a performance problem. Because otherwise - if you would have substantial delay in the traffic but all the frames get finally transferred to the next internal or external hop within 500ms ASIC hold time - you would only see the congestion bottleneck. And without bottleneckmon you wouldn't see anything at all then. The switch would look clean. Nothing in porterrshow or porstatsshow. Both show only external port counters anyway. As a SAN administrator you would not suspect anything in the director to cause this.
And still it would be there. A big performance problem caused by a device communicating with too much other devices. Not a slow drain device but still causing a slow drain in the SAN.
So how to solve it?
It's basically what I wrote a year ago plus points 3. and 4. from How to deal with slow drain devices. You just have to ensure - from a architectural design point of view - that all components of the SAN are able to cope with the workload at any given time. It's both that easy and that complex. But the first step towards resolving such a situation is to detect it properly and to keep in mind what could happen.
Modified on by seb_
There are some good videos out there on the STG Europe Youtube channel about the infrastructures able to cope with analytics workloads. Distinguished Engineer John Easton discusses the requirements for these kind of workloads in the video "IBM Big Data with John Easton" below:
He points out that it is more efficient to use large memory systems with high computing power like Power Systems or System z instead of multiple parallel working System x nodes. The reason for that is the high I/O demand contrary the high wait times that result out of the usage of disk based storage systems to share the data between the nodes during processing. Especially for real-time analytics he recommends to have all the computation within the same box.
The same preference of a scale up approach of high powered systems versus scale out infrastructures is explained by Paul Prieto, Technical Strategist for Business Analytics in the video "Choosing the right platform for Cognos Analytics":
Can flash make a difference?
With I/O performance being the main reason for avoiding a scale out strategy, there is of course the question: What if I the I/O performance could be drastically enhanced? Before IBM acquired Texas Memory Systems in 2012, their RamSan systems were rarely used to accelerate scale out infrastructures as far as I know. The main use case was to boost the few big boxes running highly productive applications but waiting for their I/O due to inadequate I/O latencies provided by traditional disk storage systems. With their I/O latencies within the range of two-digit to lower three-digit microseconds and their capability to sustain several hundred thousands of IOPS they were used as a Tier 0 storage for only the most demanding and business-critical workloads.
With the integration of the now called IBM FlashSystem into the IBM storage portfolio another use case emerged and since then played a growing role in these deployments: IBM FlashSystem behind IBM SAN Volume Controller.
The pair "FlashSystem plus SVC" represents in fact two approaches:
- Using SVC to virtualize the all-flash FlashSystem and enrich its raw I/O performance with the features you expect from a today's virtualized storage solution like seamless migrations, remote copy, thin provisioning, snap-shots (FlashCopy) and many more.
- Using FlashSystem to boost existing SVC-virtualized storage environments by using it for Easy Tier as well as for pure flash-based volumes.
Especially the second way combined with the wide range of supported host systems, HBAs, and operating systems now makes it interesting for a former no-go: Running applications with really high I/O demand like analytics on scale out commodity systems while relying on an impressive I/O performance available outside in the SAN. But of course - as always - it's not that simple. Yes, there will still be scenarios where such a scale out approach is just not applicable. Especially then it might make much sense to speed up the storage even for the scale up purpose-built business analytics systems. However for many - for example SMB - companies it'd make perfect sense to run their analytics on flash-accelerated clusters of x86 based commodity hardware...
...if they do it right.
So how to do it right?
Well, this blog is not intended to explain reference architectures or architectural best practices for analytics. But I want to add the SAN point of view. (I guess you already wondered when this will start - given the usual topics of "seb's sanblog") And from my perspective as a SAN troubleshooter I can at least tell you what should be taken into consideration to not let it fail from the beginning. There are two major points: The general architecture and the hardening of the SAN. The proper architecture (for example by keeping the FlashSystem and SVC attached to the core) is the base, but a hand full of issues could have an unacceptable impact on the performance. Many of them I already covered in earlier blog posts and some of them will be the topics of future ones.
The main goal is to prevent the SVC ports from being blocked. Ever. May it be back pressure due to slow drain devices, sub-optimal cabling patterns, "unlucky" longdistance settings, enabled but unused QoS, too few buffers set for the F-ports, sheer overload of links, and many others.
With disk-based storage we talked about good average latencies of around 3ms. As the combination of FlashSystem plus SVC now works with a tenth of that and lower, the storage network's performance really start to make a difference. Usually we talk about single-digit microseconds one-way from device to device in a well-designed SAN. But the issues described above could increase this into the range of hundreds of milliseconds. Then of course it will hardly be possible to provide real-time business analytics. Therefore it is important to harden the SAN with the possibilities you have today, like - speaking of Brocade fabrics - Fabric Watch, bottleneckmon, Advanced Performance Monitoring, port fencing, traffic isolation zones, and so on. Brocade's "Fabric Resiliency Best Practices" are a good first step in this direction.
I think it's still possible to create a scale out infrastructure for business analytics even - and especially - with SAN based storage, as long as it's optimally prepared and using IBM FlashSystem solutions to overcome the mechanically caused latencies of disk storage. But it's crucial to ensure that these benefits are not rendered void by avoidable performance problems.
IBM Experts are more then willing to support you in this challenge. ;-)
Well this year passed by with highspeed. How perception can change... I felt 2011 was a pretty long year. Our first child was born and the life of my wife and me was turned upside down. It was a lot of work, but good work! And I felt my body and brain adjusting to the demands. Time felt going by slower. Maybe I was just rushed with adrenaline for several months. This year was different. Many things happened at my job and time flew by. Most of the things were internal stuff - important for me - maybe interesting, though - but unfortunately nothing I could blog about here. Sure, I'm still deeply related to the topic SAN troubleshooting, but there was so much else to do in 2012.
So what to expect from 2013?
Well, I will still be here blogging from time to time. Hopefully a little bit more than in 2012. Let's see if that really works, because we expect our second baby to be born mids of February. So I hope for the adrenaline to kick in again. :o) There are still a lot of ideas in my mind and SAN troubleshooting is an ongoing thing. I'm here to share my experiences from my work as a SAN PFE (Product Field Engineer) in the IBM ESCC. Usually I don't get much feedback about it, but what I got was really good and I'm happy that I was able to help in a good number of situations. In 2012 the blog had about 254k hits, which I think is a good amount given my relatively small target group of SAN admins, designers and troubleshooters. Of course, I don't earn any money with ads or so :o)
But often enough in 2012 I felt it wouldn't do any harm if I'd expand my scope a little bit. So in 2013 I plan to stop restricting myself and to write a little bit more about other storage related topics as well. At the moment I'm not really sure if I will do that within this blog or if I will create a new one. I'm tending to the first choice but if you, my dear reader, have some good reasons to keep the sanblog "clean", I'll consider them, too.
Until then... Have a good start into 2013 and Happy New Year!
In the last couple of years beside of the buzzwords "cloud", "big data" and "VAAI" there is another topic that plays a big role in every discussion about storage products: "easy management". In most cases it means an intuitive and catchy graphical user interface that would allow even children to manage a storage array - if you believe marketing. Along with that goes the integration of storage management tasks into the GUI of the servers temselves and of course automation of these tasks. If the highly skilled server and storage administrators don't have to invest their time into disproportionately laborious routine tasks anymore they could focus on more advanced projects.
But many companies still fight the impacts of the financial crisis. This leads to: vacant posts get dropped, teams get consolidated and cut down. The CIOs want to see the synergy effects in numbers and decreasing headcounts. Former specialized experts have to cope with more and more different systems. Less time, more work, less education, more stress, less productivity, more trouble - a downward spiral. Beside of that classical admin's work is offshored or outtasked to operating and monitoring teams with no more than broad, general skills.
In the technical support I see the effect of that "evolution" in the problem descriptions of current cases: "We see SCSI messages in the host." Or even just "We see messages. Could be the SAN." Administrators with a foundational ITIL certificate but no clue about what a Read(10) is are suddenly confronted with a host running amok with just some obscure rough messages about its storage in the logs. To ensure a quick resolution of the problem priority 1 would be to know what these messages actually mean. Often they are just forwarded from the device driver and there is no good documentation available explaining it properly. Or there is just something like "blabla ...then go to your service provider", not even mentioning which one - out of the broad bouquet one with a heterogenous infrastructure might have - this would be. If the admin lacks a fundamental understanding about the storage concepts and protocols then, he will not be able to get any senseful information out of that. And "randomly" has to pick a support organization for any of the involved machines.
The result: Long & critical outages.
So the colorful dynamic easy-to-use management interfaces protected us from the ugly technical abyss in the lower layers for the longest time. But now as there is a problem, we only get some strange sense data and don't know who could help us further. And it's the same with managing changes in the infrastructure. A lot of the problems opened at the SAN support are in fact mis-configurations, user mistakes or unrealistic expectations born out of conceptual misunderstandings. "We need this 300km synchronous mirror connection to run with 3ms latency max. We bought your enterprise SAN gear. Why is it not fast enough?". The same with slow drain devices. If a SAN admin (with also the server admin's and storage admin's hat on his head) has no idea about the traffic flow in a SAN and buffer-to-buffer credits, how could he understand the impact of a slow drain device in his environment?
That's why clouds and Storage aaS, IaaS or even SaaS are so important today. Not because of the elastic and dynamic deployment or the transparancy of the costs. But because there are less and less people with deep technical background knowledge about storage and SANs available in the companies. They seemed to be superfluous as long as everything was running fine and an un-skilled person was enough to make the few clicks in the GUI. So the only escape and the consequent next step is to move to the cloud concept.
Am I a cloud fanboy?
I wouldn't call me a fanboy. I'm a support guy and I like to troubleshoot as effectively as possible to solve a problem as quickly as possible. And to enable me doing this, I need a skilled local counterpart who is able to collect the data and to execute the action plans, who is also able to address problems to the proper support provider and to proactively monitor the environment. So if there is a classical data center with a team of skilled administrators, I'm quite happy. But if not, this "vacuum" should be filled to minimize the risk of major outages. The provider of a public cloud would have such a team.
And in private clouds?
In a well-defined and highly automatized private cloud, the remaining (most probably much smaller) team of skilled admins doesn't have to care for provisioning of LUNs and other standard tasks anymore. They would have more time for digging deeper into the stuff. You might argue now that this just repeats the story of the easy management above. Right! But as soon as you entered this path and as long as the external constraints don't change, this is the only way to go. And for some of the companies out there a private cloud might just not be the best choice and other options like outsourcing would come into play.
The most important thing is to face the truth and to make a honest review of the skills available. Your data is your most precious asset and availability is crucial. If that path leads to the cloud, there is no reason to stop now. Don't wait for the next outage!
Everyone is talking about cloud security these days. Is it clever to give my data outside my own data center? To another company? Maybe even outside the country? How safe and secure is that? Not only the way in between but also then there? Are they protected enough? Are they able to block intruders both remotely and locally? And what about attackers from within the cloud service provider? The discussion is so full of - indeed reasonable - concerns that I started to wonder.
Why do I often see SANs that are not secured at all?
I don't mean the physical access control to the machines themselves. Usually companies take that one seriously. But all the other aspects of SAN security are often disregarded according to my experience. If there is no statutory duty or the enforcement of compliance it's just a variable in the risk calculation about costs of security, probabilities and inexplicable consequences in case of security breaches. And taking also budget constraints and lack of skill and manpower into consideration SAN security is often treated as an orphan.
There is a huge market for IP security with firewalls, intrusion detection systems, DMZs, honeypots and hackers with hats in all colors of the rainbow. If a famous company is hacked or victim of a huge DDOS attack you probably read that in the IT news. But if a company has an internal security breach in their storage infrastructure they'll hardly let the public know about it.
What to do from SAN point of view?
There are multiple aspects and possibilities to secure a SAN. Let's take Brocade switches as an example and let's see what could happen...
1.) Management access control
From time to time I get a request for a password reset and the switch's root account is still on the default password. THAT'S. NOT. COOL! It's really unlikely, because in all current FabricOS versions the admin gets the prompt to change the passwords for all four pre-configured user accounts of the switch if it's on the defaults. But it still happens every now and then.
It's the same like for all other devices with user management in IT: Choose passwords, which are hard to guess, can't be found in a dictionary, contain non-alphanumeric characters and so on. Change passwords from time to time, like in a 90 days interval. Most switches support RADIUS and LDAP. The ipfilter command allows you to block telnet, enforcing the use of ssh. In addition for FabricOS v7.0x it's officially supported now to have a plain key-based ssh access for more than one user, too.
And don't stick with old switches from generations ago. Not only the lower linerate and the small feature set should be considered here, but security, too. If the firmware is very old, it's also based on old components like legacy versions of openssh & Co. Very concerning security holes have been fixed over the years. You can check the installed versions of these components here. And yes, it is quite easy to see the password hashes without the root user, but at least they are salted in the current firmwares.
Security is not only about passwords, it's about user roles, too. In the Brocade switches you can define user rights with high granularity, the DCFM has its "resource groups" and the Network Advisor works with "areas of responsibility". Use them to choose wisely who can do what. You don't want to have another Terry Childs case in the media and this time about your company, do you?
The only thing I miss for many SAN switches and other storage equipment is a real, robust and trustworthy accounting or audit log. I want to see what was done on the switch and by whom. Not only what they did via CLI, but via webinterfaces, management applications and shell-less CLI accesses, too. Is there no standard to have these data automatically forwarded to an internal, trusted collection server via a secured connection? Really?
You should encrypt your traffic. There are several possibilities to catch the signal without your knowledge, especially if your data leaves your controlled ground on the way to a remote DR location. For FCIP traffic you should always use encryption. Indisputable. And for plain fibre-based FC longdistance connections? You probably say "Hey, it's transparent and it's optical fibre, not electrical. You can't just dig a hole, rip the cadding of the cable and splice a second cable in." - You have no idea. Keep in mind that the data traversing the SAN is the really important and thus precious kind in your company. There are technical possibilities to do it and if there is opportunity, there could be a criminal mind using it. This perception seems to gain acceptance among the switch vendors more and more. For example Brocade's current 16G equipment is able to have encrypted ISLs for that matter. Of course all vendors sell SAN based encryption appliances or switches, too. This way not only the inter-location traffic is encrypted, but also for the data on the disk or tape. So if there would ever be the chance that some unauthorized person gets his hands on the storage, he won't be able to read the data.
3.) Fabric access control
What would be the easiest thing to work around passwords and encryption if an intruder would have physical access to a data center? (Just like a student employee, a temp worker, an intern, an external engineer... I think you get the point) He could simply spot a free port on a switch and connect a switch he brought in. Setting up a mirror port or changing the zoning to gain access to disks and doing some other nasty things is quite easy.
How to avoid that?
FICON environments for mainframe traffic always had higher security demands and we can use just the same features for open systems as well. There are security policies allowing us to control which devices are allowed to be connected to the fabric (DCC - device connection control), which switches can be part of the fabric (SCC - switch connection control) and which switches can modify the configuration (FCS - fabric configuration server). In addition the current Brocade FabricOS versions support DH-CHAP and FCAP using certificates for authentication.
If you want to utilize the features and mechanisms described above, the FabricOS Administrator's guide provides some good descriptions and procedures to begin with. Of course IBM offers technical consulting services to help you to secure your SAN properly.
So if you are concerned if the provision model your IT could be based on in the future is secure, you should be even more concerned about the security of your SAN today!
(Disclaimer: SAN switches from other vendors may have the same or similar security features, too. I just chose Brocade switches because of their prevalance within IBM's SAN customer base.)
Performance problems are still the most malicious issues on my list. They come in many flavors and most of them have two things in common: 1) They are hardly SAN defects and 2) They need to be solved as quickly as possible, because they really have an impact.
If just a switch crashed or an ISL dropped dead or even an ugly firmware bug blocks the communication of an entire fabric, it might ring all alarm bells. But that's something you (hopefully) have your redundancy for. Performance problems on the other hand can have a high impact on your applications across the whole data center without a concerning message in the logs, if your systems are not well prepared for it. Beside of the preparation steps I pointed out here there is a tool in Brocade's FabricOS especially for performance problems: The bottleneck monitor or short:
If a performance problem is escalated to the technical support the next thing most probably happening is that the support guy asks you to clear the counters, wait up to three hours while the problem is noticeable, and then gather a supportsave of each switch in both fabrics.
Why 3 hours?
A manual performance analysis is based on certain 32 bit counters in a supportsave. In a device that's able to route I/O of several gigabits per second, 32 bits aren't a huge range for counters and they will eventually wrap if you wait too long. But a wrapped counter is worthless, because you can't tell if and how often it wrapped. So all comparisons would be meaningless.
Beside the wait time the whole handling of the data collections including gathering and uploading them to the support takes precious time. And then the support has to process and analyze them. After all these hours of continously repeating telephone calls you get from management and internal and/or external customers, the support guy hopefully found the cause of your performance problem. And keeping point 1) from my first paragraph in mind, it's most probably not even the fault of a switch*). If he makes you aware to a slow drain device, you would now start to involve the admins and/or support for the particular device.
You definitely need a shortcut!
And this shortcut is the bottleneckmon. It's made to permanently check your SAN for performance problems. Configured correctly it will pinpoint the cause of performance problems - at least the bigger ones. The bottleneckmon was introduced with FabricOS v6.3x and some major limitations. But from v6.4x it eventually became a must-have by offering two useful features:
Congestion bottleneck detection
This just measures the link utilization. With the fabric watch license (pre-loaded on many of the IBM-branded switches and directors) you can do that already for a long time. But the bottleneckmon offers a bit more convenience and brings it in the proper context. The more important thing is:
Latency bottleneck detection
This feature shows you most of the medium to major situations of buffer credit starvation. If a port runs out of buffer credits, it's not allowed to send frames over the fibre. To make a long story short if you see a latency bottleneck reported against an F-Port you most probably found a slow drain device in your SAN. If it's reported against an ISL, there are two possible reasons:
- There could be a slow drain device "down the road" - the slow drain device could be connected to the adjacent switch or to another one connected to it. Credit starvation typically pressures back to affect wide areas of the fabric.
- The ISL could have too few buffers. Maybe the link is just too long. Or the average framesize is much smaller than expected. Or QoS is configured on the link but you don't have QoS-Zones prioritizing your I/O. This could have a huge negative impact! Another reason could be a mis-configured longdistance ISL.
Whatever it is, it is either the reason for your performance problem or at least contributing to it and should definitely be solved. Maybe this article can help you with that then.
With FabricOS v7.0 the bottleneckmon was improved again. While the core-policy which detects credit starvation situations was pretty much pre-defined before v7.0 you're now able to configure it in the minutest details. We are still testing that out more in detail - for the moment I recommend to use the defaults.
So how to use it?
At first: I highly recommend to update your switches to the latest supported v6.4x code if possible. It's much better there than in v6.3! If you look up bottleneckmon in the command reference, it offers plenty of parameters and sub-commands. But in fact for most environments and performance problems it's enough to just enable it and activate the alerting:
myswitch:admin> bottleneckmon --enable -alert
That's it. It will generate messages in your switch's error log if a congestion or a latency bottleneck was found. Pretty straightforward. If you are not sure you can check the status with:
myswitch:admin> bottleneckmon --status
And of course there is a show command which can be used with various filter options, but the easiest way is to just wait for the messages in the error log. They will tell you the type of bottleneck and of course the affected port.
And if there are messages now?
Well, there is still the chance, that there are actually situations of buffer credit starvation the default-configured bottleneckmon can't see. However as you read an introduction here, I assume you just open a case at the IBM support.
You'll Never Walk Alone! :o)
*)Depending on country-specific policies and maintenance contracts a performance analysis as described above could be a charged service in your region.
HDS' Hu Yoshida wrote an interesting theory on his blog. Basically he says that while modular dual-controller storage arrays might be useful for traditional physical server deployments, virtualized servers would need enterprise storage arrays. (Which interestingly are defined by "multiple processors that share a global cache" according to him.)
I wrote a small reply as a comment which still awaits moderation. To the present Hu usually published my few comments in his blog - regardless of how criticising they were. I don't know why it didn't happen this time, but I think the most reasonable answer is, that everybody at HDS is very busy with the BlueArc aquisition. So meanwhile I publish it here :o)
interesting read. IMHO there’s much truth in your quote “Virtual servers can be like a drug” and I think you are also right with your observation about Tier 1 applications being virtualized. From a support perspective this could lead to bad nightmares. But to be honest, I don’t get why the storage system should be the limiting factor here. The number of servers (in terms of OSes running) doesn’t change in your picture and neither did the total workload towards the storage array. They were physical servers before, now they are virtual servers (VMs) on a few physical ones. In my eyes the requirements regarding the storage environment didn’t change big times but of course you have to check carefully if your physical servers with their SAN connectivity could turn into a bottleneck themselves, as I pointed out in my latest blog post (http://ibm.co/mY5PnH).
Additionally, just a minor thing with the dual-controller arrays: Why should the outage of the remaining controllers lead to data loss? Usually the write cache of such arrays will be disabled if one controller is down, because it can’t be mirrored anymore. On one hand this means decreased performance during such maintenance, but on the other hand this means that the host gets the SCSI good status only if the I/O is really written to disk. So, there should be access loss of course, but no data loss.
If you have a different - or a similar - opinion, feel free to leave a comment here :o)
There is an interesting discussion ongoing in the Linkedin group The Storage Group. The question is "What is the REAL cost of Fibre Channel?". To my surprise the participants in this discussion relatively quickly came to the conclusion that the problem is over-provisioning resp. under-utitization. My personal opinion was:
"I would like to come back to the over-provision / under-utilization part. Being a tech support guy, I think a bit different about that. State of the art is 16G FC now but of course I see the majority of customers being on 8G or even 4G. Eventually they will move to higher speeds. Not because all of them really need the higher speed, but it's just the switches and HBAs in sales and marketing at the moment. The "speed race" is driven mostly by the vendors and the customers who really need that line rate. But is it bad for the others? I don't think so. A 16G switch is not really 2x the price of a 8G switch or 4x the price of a 4G. In fact I see the prices sinking on a per port base with increasing functionality on the other hand. And then you stand there with your host X. It has a demand for let's say 200MB/s in total and you connected it to 2 redundant fabrics running with 8G, 1 port per fabric.
That makes: 200MB demand versus 1600MB available. WOW! YOU ARE TOTALLY UNDER-UTILIZED! Shame on you!
Well not really. Actually it's good to have redundancy. You know that. First of all "real" redundancy means you are at least 50% under-utilized per se. Plus the higher line rate that made no difference in the price compared to the lower line rate. That means it is normal that you end up over-provisioned and under-utilized.
In fact things start to get ugly if you really use all your links near 100%. I start to see that scenario more often recently when customers put VMs on ESX hosts without really knowing their I/O demand. Many of them work till the next outage (SFPs _WILL_ break some day, a software bug could crash a switch, etc) and then you see that you have no real redundancy, because you utilize your links too high.
On the other hand many of these ESX hosts with many VMs doing different unknown workload tend to turn to slow drain devices as soon as I/O peaks of certain VMs come together at the same time. Then at the latest you notice that under-utilization of a network is not really a bad thing :o)"
Especially the ESX hosts turning to slow drain devices bug me most these days. Nobody really seems to know the demand of their VMs and the internal statistics of the ESX seem to be very limited for that matter. If you look on a port of a slow drain device, it will most probably still look under-utilized from a bandwidth perspective, because the missing buffers plus the error recovery will keep the plain MB/s numbers down. But in fact the port is exhaustively saturated then. And in addition the the eventually dropped frames in the SAN lead to timeouts also within the slow draining host. At the end it looks like: "My ESX is far away from utilizing its link completely but the SAN is bad! We have timouts!".
So what's the demand?
Some customers have the luxury (Should this really considered to be luxury?) of having a VirtualWisdom probe installed to monitor the exact performance values in real-time constantly. Archie Hendryx shows some of the things you could see there in practice in his whitepaper "Destroying the Myths surrounding Fibre Channel SAN". But if you don't have such gear and you don't know the demand it might be worth to have an additional ESX host for testing. It must not be the biggest machine, don't worry. Every day you would take another candidate out of your bulk of VMs with unknown I/O bandwidth (or CPU / memory / etc) demand and put it on that test server with vMotion. Being relatively unimpaired by the other VMs (at least within the ESX), you can measure all the performance values then for 24 hours and - provided no error recovery or external congestion - takes place, these are the real demands of that VM. And only based on these demands you really know which VMs are allowed to come together on the same bare metal. Only so you will have a chance to actually improve the under-utilization in a controlled manner without slamming your SAN into the realms of chaos. The approach seems very simple and straight forward for me, but I see nobody doing this. So what's my error in reasoning, dear reader?
(Thanks to Harout S Hedeshian for the picture.)
Recently I attended a presentation about IBM's cloud computing approaches by IBM Fellow Stefan Pappe. Cloud computing is a big topic in IT nowadays - no doubt about that - but how much impact does it have on SAN troubleshooting? Will the way hardware support is performed change in the cloud? Based on your understanding of the term cloud you might eighter say yes or no. In a cloud the IT is just a commodity like water or electrical power. You just use it. You most likely don't want to know how it works as long as its availability is guaranteed. If a component of a server breaks, the whole construct relies on redundancy. Either within the server (multiple paths etc) or within a pool of servers where the VMs residing on this particular piece of metal are concurrently moved to other servers. This frees up the broken one for maintenance later on.
For a SAN it's quite similar - we rely on internal redundancy (multiple power supplies, failover-able control processors and backlink modules) as well as external redundancy (second independent fabric, multiple paths, multiple ISLs), with an important exception: Some SAN-related problems have to be troubleshooted "on the open heart". Please don't understand me wrong. I don't mean that finding a good workaround isn't important - it surely is and in most scenarios it's a key element for business continuity. But if the symptoms can't be seen, it might be hard for the support member to do the problem determination.
So what now?
Most of these "workarounded" problems can still be troubleshooted if the SAN is well prepared. Especially part 2 of my How to be prepared blog post can help you with that topic. In addition Please gather a data collection from each and every component in the SAN that is related to the problem before you implement any workaround! For the SAN switches that means, if you have performance problems for example, please gather a data collection of all SAN switches.
For other problems it might be necessary to actually test the repaired component / modified configuration / improvement in the code in the productive environment to know if it really helped. Of course all the possibles tests that can be done "offline" should be done first. For example before bringing a formely toggling ISL back to life, it's better to use the built-in port test capabilities of the switches with loopback-plugs.
And as another exception compared with server redundancy: A SAN troubleshooting should not be postponed to gather "workarounded" problems for a certain time and solve them later all at once.
- In most cases redundancy in the SAN means you have two things of a kind. Not five or eight or hundreds. So if the core of fabric A fails, it has to be repaired as soon as possible, because the failure of the core in fabric B will lead to the full outage.
- Different concurrent SAN problems can overlay and create much bigger problems or at least ambiguous symptoms that are much harder to troubleshoot. "Double errors" or "triple errors" are among the worst things to troubleshoot.
- SAN environments are complex structures with lots of hardware and software. There are many things that could lead to the situation that redundancy cannot be utilized properly such as bugs in multipath drivers, wrong configurations or underestimation of the workload on the redundant paths and components during a problem situation.
So if it can be done now, do it now!
Beside of that there are special requirements of the cloud such as the ability for multi-tenancy on the SAN components. Cisco have their VSANs for a long time now, but when it comes to IVR (Inter VSAN Routing) sometimes I see very strange configurations out there based on a wrong understanding of the concept. The first attempt of Brocade in that direction were the "Administrative Domains" which came with some very concerning flaws in my opinion. With the v6.2x code stream this concept was virtually replaced by the "Virtual Fabrics" concept. With "base switches", "XISLs" & co, many new possibilities for mis-configurations appeared. Much new stuff to learn for customers, admins, architects and of course support members.
To sum up, I can say that if SAN troubleshooting was done properly before, there won't be much change here. But the cloud boosts the expectations of the users regarding their SAN even more to: It should just work! No downtime of the application ever! Our primary goal is to deal with upcoming problems in a way that prevents any impact on the applications.
Because in the future zero downtime will be no highend enterprise feature anymore but a commodity.