First of all: the following blog is about some SAN extension considerations related to Brocade SAN Switches. The described problems may affect other vendors as well but will not be discussed here. It will also not cover all sub topics and consideration but describes a specific problem.
There are a lot of different SAN extensions out there in the field and Brocade supports a considerable proportion of them. You can see them in the Brocade Compatibility Matrix in the "Network Solutions" section. As offsite replication is one of the key items of a good DR solution, I see many environments spread over multiple locations. If the data centers are near enough to avoid slower WAN connections usually multiplexers like CWDM, TDM or DWDM solutions are used to bring several connections on one long distance link.
From a SAN perspective these multiplexers are transparent or non-transparent. Transparent in this context means that:
- They don't appear as a device or switch in the fabric.
- Everything that enters the multiplexer on one site will come out of the (de-)multiplexer on the other site in exactly the same way.
While the first point is true for most of the solutions, the second point is the crux. With "everything" I mean all the traffic. Not only the frames, but also the ordered sets. So it should be really the same. Bit by bit by bit exactly the same. If the multiplexing solution can only guarantee the transfer of the frames it is non-transparent.
So how could that be a problem?
In most cases the long distance connection is an ISL (Inter Switch Link). An ISL does not only transport "user frames" (SCSI over FC frames from actual I/O between an initiator and a target) but also a lot of control primitives (the ordered sets) and administrative communication to maintain the fabric and distribute configuration changes. In addition there are techniques like Virtual Channels or QOS (Quality of service) to minimize the influence of different I/O types and techniques to maintain the link in a good condition like fillwords for synchronization or Credit Recovery. All these techniques rely on a transparent connection between the switches. If you don't have a transparent multiplexer, you have to ensure that these techniques are disabled and of course you can't benefit from their advantages. Problems start when you try to use them but your multiplexer doesn't meet the prerequirements.
What can happen?
Credit Recovery - which allows the switches to exchange information about the used buffer-to-buffer credits and offers the possibility to react on credit loss - cannot work if IDLEs are used as a fillword. They would use several different fillwords (ARB-based ones) to talk about their states. If the multiplexer cuts all the fillwords and just inserts IDLEs at the other site (some TDMs do that) or if the link is configured to use IDLEs, it will start toggeling with most likely disastrous impact for the I/O in the whole fabric.
Another problem appears less obvious. I mentioned Virtual Channels (VC) before. The ISL is logically split. Of course not the fibre itself - the frames still pass it one by one. But the buffer management establishes several VCs. Each of them has its own buffer-to-buffer credits. There are VCs solely used for administrative communication like the VC0 for Class_F (Fabric Class) traffic. Then there several VCs dedicated to "user traffic". Which VC is used by a certain frame is determined by the destination address in its header. A modulo operation calculates the correct VC. The advantage of that is that a slow draining device does not completely block an ISL because no credits are sent back to enable the switch to send the next frame over to the other side. If you have VCs the credits are sent back as "VC_RDY"s. If your multiplexer doesn't suport that (along with ARB fillwords) because it's not transparent, you can't have VCs and "R_RDY"s will be used to send credits. The result: As you have only one big channel there, Class_F and "user frames" (Class_3 & Class_2) will share the same credits and the switches will prioritize Class_F. If you have much traffic anyway or many fabric state changes or even a slow draining device, things will start to become ugly: The both types of traffic will interfer, buffer credits drop to zero, traffic gets stalled, frames will be delayed and then dropped (after 500ms ASIC hold time). Error Recovery will generate more traffic and will have impact on the applications visible as timeouts. Multipath drivers will failover paths, bringing more traffic on other ISLs passing most probably the same multiplexer. => Huge performance degradation, lost paths, access losses, big trouble.
You see, using the wrong (or at least "non-optimal") equipment can lead to severe problems. It's even more provoking the used multiplexer in fact is transparent but the wrong settings are used in the switches. So if you see such problems or other similar issues and you use a multiplexer on the affected paths, check if your multiplexer is transparent (with the matrix linked above) and if you use the correct configuration (refer to the FabOS Admin Guide). And if you have a non-transparent multiplexer and no possibility to get a transparent one, don't hesitate to contact your IBM sales rep and ask him about consultation on how to deal with situations like this (e.g. with traffic shaping / tuning, etc).
Two additional topics for my previous post came into my mind and I doubt that they will be the last ones :o)
Have a proper SAN management infrastructure
For most of you it's self-evident to have a proper SAN management infrastructure, but from time to time I see environments where this is not the case. In some it's explained with security policies ("Wait - you are not allowed to have your switches in a LAN? And the USB port of your PC is sealed? You have no internet access? No, I don't think that you should send a fax with the supportshow...), sometimes it's just economizing on the wrong end. And sometimes there is just no overall plan for SAN management. So I think at least the following things should be given to enable a timely support:
- A management LAN with enough free ports to allow integration of support-related devices. For example a Fibre Channel tracer.
- A host in the management LAN which is accessible from your desk (e.g. via VNC or MS RDP) and has access to the management interfaces of all SAN devices. This host should at least boot from an internal disk rather than out of the SAN.
- A good ssh and telnet tool should be installed which allows you to log the printable output of a session into a text file. I personally like PuTTY.
- A tFTP- and a FTP-Server on the host mentioned above. It can be used for supportsaves, config backups, firmware updates etc. They should always run and where it's possible the devices should be pre-configured to use them. (e.g. with supportftp in Brocade switches)
- If it's possible with your security policy, it's helpful to have Wireshark installed on it which could be used for "fcanalyzer" traces in Cisco switches or also to trace the ethernet if you have management connection problems with your SAN products.
- The internet connection needs enough upload bandwidth. Fibre Channel traces can be several gigabytes in size. When time matters undersized internet connections are a [insert political correct synonym for PITA here :o) ]
- Callhome and remote support connection where applicable. Callhome can save you a lot of time in problem situations. No need to call support and open a case manually. The support will call you. And most of the SAN devices will submit enough information about the error to give the support member at least an idea where to start and which steps to take first. So in some situations callhomes trigger troubleshooting before your users even notice a problem. In addition some machines (like DS8000) allow the support to dial into it and gather the support data directly - and only the support data. Don't worry - your user data is safe!
- Have all passwords at hand. This includes the root passwords as some troubleshooting actions can only be done with a root user.
- Have all cables and at least one loopback plug at hand. With cables I mean at least: one serial cable, one null-modem cable, one ethernet patch cable and one ethernet crossover cable (not all devices have "auto-negotiating" GigE interfaces)... better more. And of course a good stock of FC cables should be onsite as well.
- The NTP servers as mentioned in my previous blog post.
Monitoring, counter resets and automatic DC
Beside of any SAN monitoring you hopefully do already (Cisco Fabric Manager / Brocade DCFM / Network Advisor / Fabric Watch / SNMP Traps / Syslog Server / etc) there is one thing in addition: automatic data collections based on cleared counters. Finding physical problems on links, frame corruption on SAN director backlinks, slow drain devices or toggeling ports - for all these problems it helps a lot if you can 1. do problem determination based on counters cleared on a regular basis and 2. look back in time to see exactly when it started and maybe how the problem "evolved" over time.
What you need is some scripting skills and a host in the management LAN (with an FTP server) to run scripts from as mentioned above. A good practice is, to have a look for a good time slot - better do not do this on workload peak times - and set up a timed script (e.g. cron job) that does:
- Gather data collections of all switches - use "supportsave" for Brocade switches and for Cisco switches log the output of a "show tech-support details" into a text file.
- Reset the counters - use both "slotstatsclear" and "statsclear" for Brocade switches and for Cisco switches run both "clear counters interface all" and "debug system internal clear-counters all". The debug command is a hidden one, so please type in the whole one as auto-completion won't work. The supportsave is already compressed but for the Cisco data collection it might be a good idea to compress it with the tool of your choice afterwards.
Additional hint: Use proper names for the Cisco Data collections. They should at least contain the switchname, the date and the time!
Depending on the disk space and the number of the switches, it may be good to delete old data collections after a while. For example you could keep one full week of data collections and for older ones only keep one per week as a reference.
If you have a good idea in addition how to be best prepared for the next problem case, please let me know. :o)
Recently I attended a presentation about IBM's cloud computing approaches by IBM Fellow Stefan Pappe. Cloud computing is a big topic in IT nowadays - no doubt about that - but how much impact does it have on SAN troubleshooting? Will the way hardware support is performed change in the cloud? Based on your understanding of the term cloud you might eighter say yes or no. In a cloud the IT is just a commodity like water or electrical power. You just use it. You most likely don't want to know how it works as long as its availability is guaranteed. If a component of a server breaks, the whole construct relies on redundancy. Either within the server (multiple paths etc) or within a pool of servers where the VMs residing on this particular piece of metal are concurrently moved to other servers. This frees up the broken one for maintenance later on.
For a SAN it's quite similar - we rely on internal redundancy (multiple power supplies, failover-able control processors and backlink modules) as well as external redundancy (second independent fabric, multiple paths, multiple ISLs), with an important exception: Some SAN-related problems have to be troubleshooted "on the open heart". Please don't understand me wrong. I don't mean that finding a good workaround isn't important - it surely is and in most scenarios it's a key element for business continuity. But if the symptoms can't be seen, it might be hard for the support member to do the problem determination.
So what now?
Most of these "workarounded" problems can still be troubleshooted if the SAN is well prepared. Especially part 2 of my How to be prepared blog post can help you with that topic. In addition Please gather a data collection from each and every component in the SAN that is related to the problem before you implement any workaround! For the SAN switches that means, if you have performance problems for example, please gather a data collection of all SAN switches.
For other problems it might be necessary to actually test the repaired component / modified configuration / improvement in the code in the productive environment to know if it really helped. Of course all the possibles tests that can be done "offline" should be done first. For example before bringing a formely toggling ISL back to life, it's better to use the built-in port test capabilities of the switches with loopback-plugs.
And as another exception compared with server redundancy: A SAN troubleshooting should not be postponed to gather "workarounded" problems for a certain time and solve them later all at once.
- In most cases redundancy in the SAN means you have two things of a kind. Not five or eight or hundreds. So if the core of fabric A fails, it has to be repaired as soon as possible, because the failure of the core in fabric B will lead to the full outage.
- Different concurrent SAN problems can overlay and create much bigger problems or at least ambiguous symptoms that are much harder to troubleshoot. "Double errors" or "triple errors" are among the worst things to troubleshoot.
- SAN environments are complex structures with lots of hardware and software. There are many things that could lead to the situation that redundancy cannot be utilized properly such as bugs in multipath drivers, wrong configurations or underestimation of the workload on the redundant paths and components during a problem situation.
So if it can be done now, do it now!
Beside of that there are special requirements of the cloud such as the ability for multi-tenancy on the SAN components. Cisco have their VSANs for a long time now, but when it comes to IVR (Inter VSAN Routing) sometimes I see very strange configurations out there based on a wrong understanding of the concept. The first attempt of Brocade in that direction were the "Administrative Domains" which came with some very concerning flaws in my opinion. With the v6.2x code stream this concept was virtually replaced by the "Virtual Fabrics" concept. With "base switches", "XISLs" & co, many new possibilities for mis-configurations appeared. Much new stuff to learn for customers, admins, architects and of course support members.
To sum up, I can say that if SAN troubleshooting was done properly before, there won't be much change here. But the cloud boosts the expectations of the users regarding their SAN even more to: It should just work! No downtime of the application ever! Our primary goal is to deal with upcoming problems in a way that prevents any impact on the applications.
Because in the future zero downtime will be no highend enterprise feature anymore but a commodity.
Brocade recently released its 16G platform switches and along with them a new major version of FabricOS: FOS 7.0. Beside the new features customer's admins, architects or end-users might be interested in, I see some nice enhancements and new tools for us support people, too. In the next blog posts I would like to present some of them and show how to use them, why they are important and where they apply.
The first one I want to write about is the D-Port or Diagnostics Port. This is a special mode every port on Brocade's 16G platform can be configured to.
Why should I use it?
Imagine a two-fabric setup, both spread over two locations, connected via some trunked ISLs through a DWDM. Every once in a while I get a case like this where there was a problem with one of these ISLs. Usually the end-users report major performance problems, there might even be crashes of hosts. The SAN admin looks into his switches, the server admins look for their messages against their HBAs and quickly they notice that the problem seems to be in one fabric only and having a redundant second fabric available the decision is made: "Let's block the ISLs in the affected fabric. The workaround is effective, the situation calms down, the business impact disappears. But of course there is no redundancy anymore and the next step is to find out what happened and subsequently it has to be resolved.
So a problem case is opened at the technical support. The first request from the support people will be to gather a supportsave. Often they even request to clear the counters and wait some time before gathering the data.
But it's useless now!
Of course it's most important to stop any business impact by implementing a workaround as quickly as possible, but if I get a data collection like this, it's like being asked to heal a disease on the basis of a photo of an already dead person. Usually no customer will allow to re-enable the ISLs before the cause of the problem is found and solved. Welcome to a recursive nightmare! :o)
That's where D-Ports come into play
Having Diagnostics ports on both sides of the link will allow you to test a connection between two switches without having a working ISL. This means there will be no user traffic and also no fabric management over this link and so there will be no impact at all. From a fabric perspective, the ISL is still blocked. It comes with several automatic tests:
- Electrical loopback - (only with 16G SFP+) tests the ASIC to SFP connection locally
- Optical loopback - (with 16G SFP+ and 10G SFP+) tests the whole connection physically.
- Link traffic test - (with 16G SFP+ and 10G SFP+) does latency and cable length calculation and stress test
So this can even help you to determine the right setup for your long distance connection!
How to do it?
Although it's very easy to set this up in Network Advisor (only supported with 16G SFP+), as a support member I prefer stuff to be done via CLI, because then I can see it in the CLI history. (By the way, a real accounting or audit log covering both CLI and GUI actions would be very useful. I look at you, Brocade!) At first you should know which are the corresponding ports in the two switches. (The Network Advisor would do that for you.) Then you disable them on both sides using
Once disabled you can configure the D-Port:
portcfgdport --enable port
And finally enable it again using
Of course you would do that on both sides. There's a seperate command to view the results then:
B6510_1:admin> portdporttest --show 7
Remote WWNN: 10:00:00:05:33:69:ba:97
Remote port: 25
Start time: Thu Sep 15 02:57:07 2011
End time: Thu Sep 15 02:58:23 2011
Test Start time Result EST(secs) Comments
Electrical loopback 02:58:05 PASSED -- ----------
Optical loopback 02:58:11 PASSED -- ----------
Link traffic test 02:58:18 PASSED -- ----------
Roundtrip link latency: 924 nano-seconds
Estimated cable distance: 1 meters
If you see the test failing, you have your culprit and based on which one is failing actions can be defined to resolve the problem. Your IBM support will of course help you with that! :o)
So if you face similar problems and you are already using 16G switches with 16G SFP+ installed, feel free to implement a workaround like blocking the ISLs to lower the impact. The D-Port will help to find out the reasons afterwards.
But if you are still on 4G or 8G hardware and you want to disable the most probable guilty ports, then please PLEASE get me a supportsave first!
Better: Clear the counters, wait 10 minutes and then gather a supportsave before you disable the ports. And even better than that: Clear counters periodically as described here.
Time for another piece of my little series! This time I'd like to write about a new feature in v7.0x especially for administrators and support personnel: The Frame Log. Maybe it's a bit early to write about it, because it seems to be a feature "in development" at the moment, but I did wait for it so long I'm just not able to resist. I think and I hope Brocade will further develop it like the bottleneckmon - which I was very sceptical about in its first version when it was released in the v6.3 code. After seeing its functionality being extended on v6.4 and even more in v7.0, the bottleneckmon is an absolute must-have.
Hmm... maybe I should write an article about bottleneckmon, too :o)
Back to the Frame Log. So what's that?
Basically it is a list of frame discards. There are several reasons why a switch would have to drop a frame instead of delivering it to the destination device. One of them is a timeout. If a frame sticks in the ASIC (the "brain" behind the port) for half a second, the switch has to assume that something's going wrong and so the frame cannot be delivered in time anymore. Then it drops it. Till FabOS v7.0 it just increased a counter by one. Since later v6.2x versions it was at least logged against the TX port (the direction towards the reason for the drop) - in earlier versions the counter increased only for the origin port, which made no sense at all. But now we even have a log for it! A log to store all the frames the switch had to discard. While that sounds a bit like rummaging through the switch's trash bin, the Frame Log is very useful for troubleshooting though. It contains the exact time, the TX and the RX port (keep in mind the TX is the important one) and even information from the frame itself. In the summary view you see the fibrechannel addresses of the source device (SID) and of the destination device (DID).
For example to see the two most recent frame discards in summary mode, just type:
B48P16G:admin> framelog --show -mode summary -n 2
Fri Sep 23 16:07:13 CET 2011
Log TX RX
timestamp port port SID DID SFID DFID Type Count
Sep 29 16:02:08 7 5 0x040500 0x013300 1 1 timeout 1
Sep 29 16:04:51 7 1 0x030900 0x013000 1 1 timeout 1
In the so called "dump mode" you even see the first 64 bytes of each frame. Usually I have to bring an XGIG tracer onsite to catch such information and often it's not even possible to catch it then, because an XGIG can only trace what's going through the fibre. So you'll only see this frame if you trace a link it crosses before it is dropped. And even then you can't trigger (=stop) the tracer directly on this event, but you have to have it looking for a so called ABTS (abort sequence). If a frame is dropped the command will time out in the initiator and it will send this ABTS. Depending on what frame exactly was dropped in what direction, the ABTS could be on the link several minutes after the actual drop of the frame. Imagine a READ command being dropped. The error recovery will start after the SCSI timeout which could be e.g. 2 minutes. But 2 minutes is a long time in a FC trace. Chances are good that the tracer misses it then.
Not so with the Frame Log!
The frame log can tell you exactly which frame was dropped. If you try to find out if a particular I/O timeout in your host was caused by a timeout discard in the fabric, this is your way to go. If you see your storage array complaining about aborts for certain sequences, just look them up in the Frame Log. With this feature Brocade finally catches up with Cisco and their internal tracing capabilities - and Brocade does it way more comfortable for the admin. The logging of discarded frames is enabled by default and it works on all 8G and 16G platform switches without any additional license.
The big "BUTs"
As I mentioned at the beginning of this article there are still things for Brocade to work on to turn the Frame Log into a must-have tool like the bottleneckmon. The first catch is its volatility. In the current version it can only keep 50 frames per second on an ASIC base for 20 minutes in total. At the moment I personally think that's too short. But I'll wait for the first cases where I can use it before I forge an ultimate opinion about this limit.
The other - more concerning - constraint is that it only works for discards due to timeout at the moment. So if a frame is dropped because of one of all the other possible reasons, it won't be visible in the Frame Log in its current implementation. But that's exactly what I need! If the switch discards a frame because of a zone mismatch or because the destination switch was not reachable or because the target device was temporarily offline or whatever - I want to see that. If a server is misconfigured (uses wrong addresses) and so cannot reach its targets, you'd see the reason right there in the target log - no tracing needed! There are plenty of other situations that would be covered with such a functionality. So I honestly hope that there is a developer with a concept like this in his drawer or even already within its implementation. Allow me to assure you that there is at least one support guy waiting for it...
The picture is from Zsuzsanna Kilian. Thank you!
Many of you (at least many of the few really reading this stuff) may already know what CRC is. But I think it doesn't hurt to have a short recap. CRC means Cyclic Redundancy Check and can be used as an error detection technique. Basically it calculates a kind of hash value that tends to be very different if you change one or more bits in the original data. Beside of that it's quite easy to implement. I once wrote a CRC algorithm in assembler (but for the Intel 8008) during my study and it was a nice exercise for optimization.
What has that got to do with SAN?
In Fibre Channel we calculate a CRC value for each frame and store it as the next-to-last 4 bytes before the actual end of frame (EOF). The recipient will read the frame bit by bit and meanwhile it calculates the CRC value by itself. Reaching the end of the frame it knows if the CRC value stored there matches the content of the frame. If this is not the case, it knows that there was at least one bit error and it is supposed to be corrupted and thus can be dropped. Now if the recipient is a switch the next thing to happen depends on which frame forwarding method is used:
The switch reads the whole frame into one of its ingress ("incoming") buffers and checks the CRC value. If the frame is corrupted the switch drops it. It's up to the destination device to recognize that a frame is missing and at least the initiator will track the open exchange and starts error recovery as soon as time-out values are reached. Many of the Cisco MDS 9000 switches work this way. It ensures that the network is not stressed with frames that are corrupted anyway, but it's accompanied with a higher latency. From a troubleshooting point of view the link connected to the port reporting CRC errors is most probably the faulty one.
To decrease this latency the switch could just read in the destination address and as soon as that one is confirmed to be zoned with the source connected to the F-port (a really quick look into the so called CAM-table stored within the ASIC) it goes directly on the way towards the destination. So if everything works fine - enough buffer-credits are available - the frame's header is already on the next link before the switch even read the CRC value. The frame will travel the whole path to the destination device even though it's a corrupted frame and all switches it passes will recognize that this frame is corrupted. Brocade switches work this way. As soon as the corrupted frame reaches the destination, it will be dropped.
Regardless which method is used, the CRC value remains just an error detection and most probably the whole exchange has to be aborted and repeated anyway.
So how to troubleshoot CRC errors on Brocade switches then?
If you would only have a counter for CRC errors, you would be in trouble now. Because if all switches along the path increase their CRC error counter for this frame, how would you know which one is really broken? If you have multiple broken links in a huge SAN, this could turn ugly. But there are 2 additional counters for you:
- enc in - The frame is encoded additionally in a way that bit errors can be detected. And because the frame is decoded when it's read from the fiber and encoded again before it's sent out to the next fiber, the enc in (encoding errors inside frames) counter will only increase for the port that is connected to the faulty link.
- crc g_eof - Although a corrupted frame will be cut-through as explained above, there is just one thing the switch can do in addition when it encounters a mismatch between the calculated CRC value and the one stored in the frame: it will replace the EOF with another 4 bytes meaning something like "This is the end of the frame, but the frame was recognized as corrupted." The crc g_eof counter basically means "The CRC value was wrong but nobody noticed it before. Therefore it still had a good EOF." So if this counter increases for a particular link, it is most probably faulty.
frames enc crc crc too too bad enc disc link loss loss frjt fbsy
tx rx in err g_eof shrt long eof out c3 fail sync sig
1: 1.5g 1.8g 13 12 12 0 0 0 1.1m 0 2 650 2 0 0
2: 1.3g 1.4g 0 101 0 0 0 0 0 0 0 0 0 0 0
3: 1.9g 2.9g 82 15 0 0 3 12 847 0 0 0 0 0 0
Port 1 shows a link with classical bit errors. You see CRC errors and also enc in errors. Along with them you see
crc g_eof. Everything as expected. Just go ahead and and check / clean / replace the cable and/or SFPs. There are some tests you could do to determine which one is broken like "porttest" and "spinfab".
Port 2 is a typical example of an ISL with forwarded CRC errors. This ISL itself is error-free. It just transported some previously corrupted frames (crc err but no enc in) which were already "tagged" as corrupted, hence no crc g_eof increases.
Port 3 is a bit tricky now. If you just rely on crc g_eof it seems to be a victim of forwarded CRC errors, too. But that's not the case. Actually they were broken in a manner that the end of the frame was not detected properly, so too long an bad eof is increased. Best practice: Stick with the enc in counter. It still shows that the link indeed generates errors.
Hold on, Help is on the way!
Now with 16G FC as state of the art things changed a bit. It uses a new encoding method and it comes with a forward error correction (FEC) feature. Brocade provides this with its FabricOS v7.0x on the 16G links. It will be able to correct up to 11 bits in a full FC frame. FEC is not really highlighted or specially standing out in their courses and release notes, but in my opinion this thing is a game changer! Eleven bit errors within one frame! Based on the ratio between enc in and crc err - which basically shows how many bit errors you have in a frame on the average - we see so far, I assume this to just solve over 90% of the physical problems we have in SANs today. Without the end-device-driven error recovery which takes ages in Fibre Channel terms. Less aborts, less time-outs, less slow drain devices because of physical problems! If this works as intended SANs will reach a new level of reliability.
So let's see how this turns out in the future. It might be a bright one! :o)
There are some goodies in FOS 7.0 that are not announced big-time. Goodies especially for us troubleshooters. There are regular but not too frequent so called RAS meetings. Here we have the possibility to wish for new RAS features - wishes born out of real problem cases. Some of the wishes we had were implemented in FOS 7.0 (beside of the Frame Log I already described in a previous post).
Time-out discards in porterrshow
You probably noticed that I have a hobbyhorse when it comes to troubleshooting in the SAN: performance problems. Medium to major SAN-performance problems usually go along with frame drops in the fabric. If a frame is kept in a port's buffer for 500ms, because it can't be delivered in time, it will be dropped. So these drops would be a good indicator for a performance problem. There is a counter in portstatsshow for each port (depending on code version and platform) named er_tx_c3_timeout, which shows how often the ASIC connected to a specific port had to drop a frame that was intended to be sent to this port. It means: This guy was busy X times and I had to drop a frame for him.
But who looks in the portstatsshow anyway? At least for monitoring? In that area the porterrshow command is way more popular, because it provides a single table for all FC ports showing the most important error counters. Unfortunately it had only one cumulative counter for all reasons of frame discards - and there are a lot more beside of those time-outs. But now there are two additional counters in this table: c3-timeout tx and c3-timeout rx. Out of them the tx counter is the important one as described above. The rx counter just gives you an idea where the dropped frames came from.
So: just focus on the TX! If it counts up, get some ideas how to treat it here.
The firmware history
Just last week I had a fiddly case about firmware update problems again. There are restrictions about the version you can update to based on the current one. If you don't observe the rules, things could mess up. And they could mess up in a way you don't see straightaway. But then suddenly, after some months and maybe another firmware update, the switch runs into a critical situation. Or it has problems with exactly that new firmware update. Some of these problems can render a CP card useless, which is ugly because from a plain hardware point of view nothing is broken. But the card has to be replaced at the end. Sigh.
To make a long story short: Wouldn't it be better to actually know the versions the switch was running on in the past? And that's the duty of the firmware history:
switch:admin> firmwareshow --history
Firmware version history
Sno Date & Time Switch Name Slot PID FOS Version
1 Fri Feb 18 12:58:06 2011 CDCX16 7 1556 Fabos Version v7.0.0d
2 Wed Feb 16 07:27:38 2011 CDCX16 7 1560 Fabos Version v7.0.0a
(example borrowed from the CLI guide)
No access - No problem
There is a mistake almost everybody in the world of Brocade SAN administration makes (hopefully only) once: Trying to merge a new switch into an existing fabric and fail with a segmented ISL and a "zone conflict". Then the most probable reason is that the new switch's default zoning (defzone) is set to "no access".
This feature was introduced a while ago to make Brocade switches a little more safe. Earlier each port was able to see every other port as long as there was no effective zoning on the switch. With "no access" enabled, all traffic between each unzoned pair of devices is blocked if there is no zone including them both. The drawback of "no access" is its technical implementation, though. As soon as it was enabled a hidden zone was created and its pure existence blocked the traffic for all unzoned devices. And so without any indication the switch did end up with a zone.
But entre nous: no sane person accepts this without raising a few eyebrows. With FOS 7.0 this (mis-)behavior is gone. The new switch has a "no access" setting and wants to merge the fabric? Fine. You don't have to care, the firmware cares for you!
Thanks for the little helpers Brocade - and I hope you stay open for new ideas :o)
Performance problems are still the most malicious issues on my list. They come in many flavors and most of them have two things in common: 1) They are hardly SAN defects and 2) They need to be solved as quickly as possible, because they really have an impact.
If just a switch crashed or an ISL dropped dead or even an ugly firmware bug blocks the communication of an entire fabric, it might ring all alarm bells. But that's something you (hopefully) have your redundancy for. Performance problems on the other hand can have a high impact on your applications across the whole data center without a concerning message in the logs, if your systems are not well prepared for it. Beside of the preparation steps I pointed out here there is a tool in Brocade's FabricOS especially for performance problems: The bottleneck monitor or short:
If a performance problem is escalated to the technical support the next thing most probably happening is that the support guy asks you to clear the counters, wait up to three hours while the problem is noticeable, and then gather a supportsave of each switch in both fabrics.
Why 3 hours?
A manual performance analysis is based on certain 32 bit counters in a supportsave. In a device that's able to route I/O of several gigabits per second, 32 bits aren't a huge range for counters and they will eventually wrap if you wait too long. But a wrapped counter is worthless, because you can't tell if and how often it wrapped. So all comparisons would be meaningless.
Beside the wait time the whole handling of the data collections including gathering and uploading them to the support takes precious time. And then the support has to process and analyze them. After all these hours of continously repeating telephone calls you get from management and internal and/or external customers, the support guy hopefully found the cause of your performance problem. And keeping point 1) from my first paragraph in mind, it's most probably not even the fault of a switch*). If he makes you aware to a slow drain device, you would now start to involve the admins and/or support for the particular device.
You definitely need a shortcut!
And this shortcut is the bottleneckmon. It's made to permanently check your SAN for performance problems. Configured correctly it will pinpoint the cause of performance problems - at least the bigger ones. The bottleneckmon was introduced with FabricOS v6.3x and some major limitations. But from v6.4x it eventually became a must-have by offering two useful features:
Congestion bottleneck detection
This just measures the link utilization. With the fabric watch license (pre-loaded on many of the IBM-branded switches and directors) you can do that already for a long time. But the bottleneckmon offers a bit more convenience and brings it in the proper context. The more important thing is:
Latency bottleneck detection
This feature shows you most of the medium to major situations of buffer credit starvation. If a port runs out of buffer credits, it's not allowed to send frames over the fibre. To make a long story short if you see a latency bottleneck reported against an F-Port you most probably found a slow drain device in your SAN. If it's reported against an ISL, there are two possible reasons:
- There could be a slow drain device "down the road" - the slow drain device could be connected to the adjacent switch or to another one connected to it. Credit starvation typically pressures back to affect wide areas of the fabric.
- The ISL could have too few buffers. Maybe the link is just too long. Or the average framesize is much smaller than expected. Or QoS is configured on the link but you don't have QoS-Zones prioritizing your I/O. This could have a huge negative impact! Another reason could be a mis-configured longdistance ISL.
Whatever it is, it is either the reason for your performance problem or at least contributing to it and should definitely be solved. Maybe this article can help you with that then.
With FabricOS v7.0 the bottleneckmon was improved again. While the core-policy which detects credit starvation situations was pretty much pre-defined before v7.0 you're now able to configure it in the minutest details. We are still testing that out more in detail - for the moment I recommend to use the defaults.
So how to use it?
At first: I highly recommend to update your switches to the latest supported v6.4x code if possible. It's much better there than in v6.3! If you look up bottleneckmon in the command reference, it offers plenty of parameters and sub-commands. But in fact for most environments and performance problems it's enough to just enable it and activate the alerting:
myswitch:admin> bottleneckmon --enable -alert
That's it. It will generate messages in your switch's error log if a congestion or a latency bottleneck was found. Pretty straightforward. If you are not sure you can check the status with:
myswitch:admin> bottleneckmon --status
And of course there is a show command which can be used with various filter options, but the easiest way is to just wait for the messages in the error log. They will tell you the type of bottleneck and of course the affected port.
And if there are messages now?
Well, there is still the chance, that there are actually situations of buffer credit starvation the default-configured bottleneckmon can't see. However as you read an introduction here, I assume you just open a case at the IBM support.
You'll Never Walk Alone! :o)
*)Depending on country-specific policies and maintenance contracts a performance analysis as described above could be a charged service in your region.
When Brocade released FabricOS v6.0 in 2007 Quality of Service sounded like a great idea: It allows you to prioritize your traffic flow to the level of certain device pairs. There are 3 levels of priority:
High - Medium - Low
Inter Switch Links (ISLs) are logically partitioned into 8 so called Virtual Channels (VCs). Basically each of them has its own buffer management and the decision which virtual channel a frame should use is based on its destination address. If a particular end-to-end path is blocked or really slow, the impact on the communication over the other VCs is minimal. Thus only a subset of devices should be impaired during a bottleneck situation.
Quality of Service takes this one step further.
QoS-enabled ISLs consist of 16 VCs. There are slightly more buffers associated with a QoS ISL and these buffers are equally distributed over the data VCs. (There are some "reserved" VCs for fabric communication and special purposes). The amount of VCs makes the priority work - the most VCs (and therefore the most buffers) are dedicated to the high priority, the least for the low one. Medium lies in the middle obviously. So more important I/Os benefit from more resources than the not so important ones.
Sounds like a great idea!
Theoretically you can configure the traffic flow in terms of buffer credit assignment in your fabric very fine-grained. But that's in fact also the big crux: You have to configure it! That means you actually have to know which host's I/O to which target device should be which priority. Technically you create QoS-zones to categorize your connections. Low priority zones start with QOSL, high priority zones start with QOSH. Zones without such a prefix are considered as medium priority.
But how to categorize?
That's the tricky part. The company's departments relying on IT (virtually all) have to bring in their needs into the discussion. Maybe there are already different SLAs for different tiers of storage and an internal cost allocation in place. The I/O prioritization could go along with that and of course it has to be taken into account to effectively meet the pre-defined SLAs. If you have to start from the scratch, it's more a project for weeks and months than a simple configuration. And there is much psychology in it. Beside of that you really have to know how QoS works in details to design a prioritization concept. For example if you only have 20 high priority zones and 50 with medium priority but only 3 low priority zones, the low ones could even perform better. In the four years since its release I saw only a couple of customers really attempting to implement it.
In addition you need to buy the Adaptive Networking license!
So why should I care?
If QoS is such a niche feature, why blogging about it? Usually a port is configured for QoS when it comes from the factory. You can see it in the output of the command "portcfgshow". A new switch will have QoS in the state "AE" which means auto-enabled - in other words "on". An 8Gig ISL will be logically partitioned into the 16 VCs as described above and the buffer credits will be assigned to the high, the low and the medium priority VCs. But that does not mean that you can actually benefit from the feature, because you most probably have no QoS-zones! And so all your I/O share only the resources allocated for the medium priority. A huge part of the available buffers are reserved for VCs you cannot use! So as a matter of fact you end up with less buffers than without QoS and in many cases this made the difference between smooth running environments and immense performance degradation.
If you don't plan to design a detailed and well-balanced concept about the priorities in your SAN environments, I recommend to switch off QoS on the ports. I don't say QoS is bad! In fact with the Brocade HBA's possibility to integrate QoS even into the host connection - enabling different priorities for virtualized servers - you have the possibility to better cope with slow drain device behavior. But done wrong, QoS can have a very ugly impact on the SAN's performance!
Better know the features you use well - or they might turn against you...
As this was not clear enough in the text above and I got back a question about that, please be aware: Disabling QoS is disruptive for the link! In most FabricOS versions in combination with most switch models, the link will be taken offline and online again as soon as you disable it. In some combinations you'll get the message that it will turn effective with the next reset of the link. In that case you have to portdisable / portenable the port by yourself.
As this is a recoverable, temporary error your application most probably won't notice anything, but to be on the save side, you should do it in a controlled manner and - if really necessary in your environment - in times of little traffic or even a maintenance window. The command to disable it is:
portcfgqos --disable PORTNUMBER
Everyone is talking about cloud security these days. Is it clever to give my data outside my own data center? To another company? Maybe even outside the country? How safe and secure is that? Not only the way in between but also then there? Are they protected enough? Are they able to block intruders both remotely and locally? And what about attackers from within the cloud service provider? The discussion is so full of - indeed reasonable - concerns that I started to wonder.
Why do I often see SANs that are not secured at all?
I don't mean the physical access control to the machines themselves. Usually companies take that one seriously. But all the other aspects of SAN security are often disregarded according to my experience. If there is no statutory duty or the enforcement of compliance it's just a variable in the risk calculation about costs of security, probabilities and inexplicable consequences in case of security breaches. And taking also budget constraints and lack of skill and manpower into consideration SAN security is often treated as an orphan.
There is a huge market for IP security with firewalls, intrusion detection systems, DMZs, honeypots and hackers with hats in all colors of the rainbow. If a famous company is hacked or victim of a huge DDOS attack you probably read that in the IT news. But if a company has an internal security breach in their storage infrastructure they'll hardly let the public know about it.
What to do from SAN point of view?
There are multiple aspects and possibilities to secure a SAN. Let's take Brocade switches as an example and let's see what could happen...
1.) Management access control
From time to time I get a request for a password reset and the switch's root account is still on the default password. THAT'S. NOT. COOL! It's really unlikely, because in all current FabricOS versions the admin gets the prompt to change the passwords for all four pre-configured user accounts of the switch if it's on the defaults. But it still happens every now and then.
It's the same like for all other devices with user management in IT: Choose passwords, which are hard to guess, can't be found in a dictionary, contain non-alphanumeric characters and so on. Change passwords from time to time, like in a 90 days interval. Most switches support RADIUS and LDAP. The ipfilter command allows you to block telnet, enforcing the use of ssh. In addition for FabricOS v7.0x it's officially supported now to have a plain key-based ssh access for more than one user, too.
And don't stick with old switches from generations ago. Not only the lower linerate and the small feature set should be considered here, but security, too. If the firmware is very old, it's also based on old components like legacy versions of openssh & Co. Very concerning security holes have been fixed over the years. You can check the installed versions of these components here. And yes, it is quite easy to see the password hashes without the root user, but at least they are salted in the current firmwares.
Security is not only about passwords, it's about user roles, too. In the Brocade switches you can define user rights with high granularity, the DCFM has its "resource groups" and the Network Advisor works with "areas of responsibility". Use them to choose wisely who can do what. You don't want to have another Terry Childs case in the media and this time about your company, do you?
The only thing I miss for many SAN switches and other storage equipment is a real, robust and trustworthy accounting or audit log. I want to see what was done on the switch and by whom. Not only what they did via CLI, but via webinterfaces, management applications and shell-less CLI accesses, too. Is there no standard to have these data automatically forwarded to an internal, trusted collection server via a secured connection? Really?
You should encrypt your traffic. There are several possibilities to catch the signal without your knowledge, especially if your data leaves your controlled ground on the way to a remote DR location. For FCIP traffic you should always use encryption. Indisputable. And for plain fibre-based FC longdistance connections? You probably say "Hey, it's transparent and it's optical fibre, not electrical. You can't just dig a hole, rip the cadding of the cable and splice a second cable in." - You have no idea. Keep in mind that the data traversing the SAN is the really important and thus precious kind in your company. There are technical possibilities to do it and if there is opportunity, there could be a criminal mind using it. This perception seems to gain acceptance among the switch vendors more and more. For example Brocade's current 16G equipment is able to have encrypted ISLs for that matter. Of course all vendors sell SAN based encryption appliances or switches, too. This way not only the inter-location traffic is encrypted, but also for the data on the disk or tape. So if there would ever be the chance that some unauthorized person gets his hands on the storage, he won't be able to read the data.
3.) Fabric access control
What would be the easiest thing to work around passwords and encryption if an intruder would have physical access to a data center? (Just like a student employee, a temp worker, an intern, an external engineer... I think you get the point) He could simply spot a free port on a switch and connect a switch he brought in. Setting up a mirror port or changing the zoning to gain access to disks and doing some other nasty things is quite easy.
How to avoid that?
FICON environments for mainframe traffic always had higher security demands and we can use just the same features for open systems as well. There are security policies allowing us to control which devices are allowed to be connected to the fabric (DCC - device connection control), which switches can be part of the fabric (SCC - switch connection control) and which switches can modify the configuration (FCS - fabric configuration server). In addition the current Brocade FabricOS versions support DH-CHAP and FCAP using certificates for authentication.
If you want to utilize the features and mechanisms described above, the FabricOS Administrator's guide provides some good descriptions and procedures to begin with. Of course IBM offers technical consulting services to help you to secure your SAN properly.
So if you are concerned if the provision model your IT could be based on in the future is secure, you should be even more concerned about the security of your SAN today!
(Disclaimer: SAN switches from other vendors may have the same or similar security features, too. I just chose Brocade switches because of their prevalance within IBM's SAN customer base.)
The term ecological footprint describes the total impact of someone or something on the environment. To achieve sustainability this footprint should be kept as low as possible. We should not demand more from Mother Nature than she can provide and of course we should not demand more than we actually really need. Sounds simple, but the reality is way more complex. In the area of IT the term Green IT was found to describe and consolidate all the rules, actions and requirements to decrease the ecological footprint for the sake of sustainability. And IBM has a broad agenda about this. But often we forget what each one of us could do to be a little more greener.
In the technical support we deal with defects. Our clients have the right to have a product working within the specifications. If a part is working outside its specifications, it has to be repaired or replaced. That's it.
And what's "green" about that?
The impact on the Nature happens if a part is replaced that was not really broken. No manufacturing process of a part can be so "green-optimized" that it's better than just to avoid replacing a part in good order. There is the mining (and/or recycling) for the materials, the chemicals and energy used during its processing, the packages, the stocking and of course the logistics, too. At the end a small part like a fan can have a huge ecological footprint. This can only be avoided by replacing only the broken part. There's just one problem with that:
What if you can't tell which part is broken?
A classical example for that is a physical error in the SAN. In my article about CRC I pointed out how to use the porterrshow to find physical errors and - even more important - how to find the connection where the physical error is really located. But that's all what's possible out of the data: You can only track it down to the connection. The connection usually consists of the sending SFP, the cable (plus any additional patch panels and couplers in between), and the receiving SFP. There is no reliable and technically justifiable way to tell which one is the culprit just out of the porterrshow. I know that there are some "whitepapers" available in the web stating that this combination of "crc err" and "enc in" means this and that combination of "crc err" and "enc out" means that. But from a technical point of view that's nonsense.
So you have a physical problem, what to do?
When it comes to cables, my fellow IBM blogger Anthony Vandewerdt just released a great article about the impact of dust today. Other reasons for a cable to cause physical problems could be a too small bending radius or loose couplers. In times of fully populized 48- or even 64-port cards the frontside of a SAN director often looks like the back of a hedgehog. For every maintenance action with one of the cables you can wait for the CRC error counters increasing for the other ports around then. So in many situations the cable is not really broken and just replacing it wholesale just because of the counter is not eco-friendly.
The same thing with SFPs. You see physical errors increasing in the porterrshow for a specific port. That could mean that the SFP in there is broken, because its "electric eye" doesn't interpret the (good) incoming signal correctly. It could also mean that the SFP on the other end of the cable is broken, because it sends out a signal in a bad condition. Both will lead to the very same counter increases in porterrshow. If you replace them both as the first action you most probably replaced at least one good one.
Given that you have redundancy in your SAN environment (which you should ALWAYS have), you have free ports available, and the multipath drivers for the hosts using the affected path are working properly, you could track the culprit down by plugging the cable to another SFP in another port and look if the error stays with the port or with the cable.
Please keep in mind that the port address ("the IP address of the SAN") could change along with the port (if you don't have Cisco switches). On Brocade switches you need to do a "portswap" to swap the port addresses as well.
If you cannot touch the other ports, Brocade built some tests into FabricOS for you, like "porttest", "portloopbacktest" and "spinfab". Please have a look into the Command Line Interface Reference Guide for your FabricOS version to get more information about them. With these tests in combination with a so called loopback plug it's easy to find out which part is really broken. Loopback plugs look like the end of a cable but just physically redirect the SFP's TX signal into its RX connector.
Mother Nature will be thankful
There is just one thing from above I want to pick up: parts working within their specification. Not every single CRC error is a reason to replace hardware. According to the Fibre Channel standard, the protocol requires a BER (Bit Error Rate) of 10^(-12) to work properly. For 8 or even 16 Gbps that means it's allowed and fully compliant with the FC protocol to have bit errors quite often. Here is where common sense must come into play. If you have 2-digit increases of the CRC error counter within an hour, it might be a good idea to determine which part to replace with the steps mentioned above.
If you see a single CRC from time to time, sometimes with days of no error, sometimes with "some" per day, that's perfectly fine with the FC protocol and well within the specifications. It could lead to single temporary and recoverable errors on a host, but nothing has to be replaced then as long as the rate doesn't increase significantly. You wouldn't replace your one-year-old tires just because the tread is only 90% of what it was when you bought them.
Let's think a little bit greener - even in switch maintenance :o)
I claim that in 2012 performance problems will keep their place amongst the most frequent and most impacting problems in the SAN. In many of the cases the client's users really notice a performance impact and so the admin calls for support. Other support cases are opened because of performance related messages like the ones from Brocade's bottleneckmon or Cisco's slowdrain policy for the Port Monitor. Beside of that there are also cases that look not really like performance problems from the start but turn out to occur because of the same reasons like them. "I/O abort" messages in the device log, link resets, messages about frame drops, failing remote copy links, failing backup jobs or even worse failing recoveries - these could all be "performance problems in disguise".
When I analyze the data then and find out that a slow drain device or congestion is the real reason for the problem I write my findings down and try to give the client some hints about possible next steps. For example by mentioning my earlier blog article about How to deal with slow drain devices.
Do you know what's mean about it?
Often clients never heard of slow drain devices before. Longtime storage administrators are confronted with a term that sounds like a support guy made it up to fingerpoint to another vendor's product. Of course I usually explain what it is, what it means for the fabric and for the connected devices. But to be honest, I would be sceptical, too. I would go to the next search engine and query "slow drain device". The first finds are from this blog and from the Brocade community pages and there are some questions about that topic. Considering the substance of posts in public forums, I would check Brocade's own SAN glossary. Guess what? Not a word about slow drain devices - Which is no surprise as it's from 2008. I would check wikipedia. Nothing. My fellow blogger Archie Hendryx mentioned that it's missing in the SNIA dictionary, too. And he's right: Nothing!
So why is that so?
Why are the terms "HTML" and "export" explained in the dictionary of the Storage Networking Industry Association but there is not a single appearance of the term "slow drain device" on the complete SNIA website (according to their in-built search function)? Well I don't know but of course we can change that. The SNIA dictionary makers are asking for contribution, so if you have a term that has a meaning in the storage industry, feel free to send them a definition for the next release. I thought about doing that as well for some of the SAN performance-related terms I didn't find in the dictionary. Below you'll find some definitions that I wrote. But I'm not inerrable and therefore I would like to have an open discussion about them. Let me know what you think about them. Let me know if your understanding of a term (used in the area of SAN performance of course) differs from mine. Let me know if my wording hurts the ears of native English speakers. Let me know if you have a better definition. Let me know if there are important terms missing. And let me know if you think that a term is not really so generally used or important that it should appear in the SNIA dictionary - side by side to sophisticated terms like Tebibyte :o).
slow drain device - a device that cannot cope with the incoming traffic in a timely manner.
Slow drain devices can't free up their internal frame buffers and therefore don't allow the connected port to regain their buffer credits quickly enough.
congestion - a situation where the workload for a link exceeds its actual usable bandwidth.
Congestion happens due to overutilization or oversubscription.
buffer credit starvation - a situation where a transmitting port runs out of buffer credits and therefore isn't allowed to send frames.
The frames will be stored within the sending device, blocking buffers and eventually have to be dropped if they can't be sent for a certain time (usually 500ms).
back pressure - a knock-on effect that spreads buffer credit starvation into a switched fabric starting from a slow drain device.
Because of this effect a slow drain device can affect apparently unrelated devices.
bottleneck - a link or component that is not able to transport all frames directed to or through it in a timely manner. (e.g. because of buffer credit starvation or congestion)
Bottlenecks increase the latency or even cause frame drops and upper-level error recovery.
Feel free to use the comment feature here or tweet your thoughts with hashtag #SANperfdef. If you add @Zyrober in the tweet, I'll even get a mail :o)
I updated the definitions with an additional sentence. Feel free to comment.
I didn't blog for a while now because of an internal project. Like each software development project it's never really over and development will be going on in the next years to bring in new functions, but I hope I have some more time for blogging again now. :o) I also decided to go a bit away from the long blog posts I did in the past to more conveniently readable short posts if possible.
Long distance modes
Brocade has basically 3 long distance modes:
- LE mode - merges all user-data virtual channels and assigns the amount of buffers necessary to cover a 10 km distance based on the full frame size for the given speed. It requires no license.
- LS mode - like LE mode, but is used for distances > 10 km and requires the "Extended Fabric License". You configure it with a fixed distance.
- LD mode - similar to LS mode, but the distance is measured automatically and the buffers are assigned according to the measured distance. You configure it with a "desired distance".
So what's the problem with LD?
If you have two data centers with a distance of 30 km between them and you configure 60 km, the switch will only assign the buffers for the measured 30 km. Increasing the desired distance doesn't change anything.
Wait! Why should I increase it anyway?
As written above the number of buffers depends on the distance. The switch just calculates the amount of buffers by the number of full sized frames (frames with maximum frame size - usually 2kB) needed to span the distance. But the problem is: in real life the average frame size is actually much smaller than the maximum one.
In the picture above you see a write I/O out of a fibre channel trace. The lines with the rose background are the frames from the host, the ones with the gray background are the responses from the storage. The last column shows the size of the frame. Only the 4 data frames have the full frame size. The other 3 frames have a size far smaller than 2kB. So the average frame size in this example is just 1.2kB. With this average frame size you would need almost double the amount of buffers to fill the link than the number the switch calculated! And it could be much worse. I ran a report over the full trace and the average frame size for the transmit and receive traffic was:
Given that numbers and added a "little buffer reserve" you would need 3 times the buffers than the switch would use!
Okay so let's give it more buffers!
Yes, for LS mode this would exactly be the action plan. But remember: For LD mode, the switch just uses the measured distance. The desired distance is only used as an additional maximum. So if you have 30 km and configure 20 km, it will only assign the buffers for 20 km. If you configure 50 km, it will only assign the buffers for 30 km. So my general recommendation is:
Use LS instead of LD!
LS mode gives you the full control. And use it with enough buffers by configuring a multiple of the physical distance. 3x is a good practice but you can increase it even more if there are buffers left. You can always check the available buffers with the command "portbuffershow".
Don't leave those lazy buffers unassigned but use them to fill your links!
In one of my previous posts I wrote about "Why inter-node traffic across ISLs should be avoided". There is an additional "bad practice" that could lead to performance problems in the host-to-SVC traffic.
Let's imagine a core-edge fabric. A powerful switch (or director) in its center is the core. The SVC and its backend storage subsystems are directly connected to it. Beside of that there are also the ISLs to the edge switches where the hosts are connected to. As there is an SVC in the fabric, all host traffic usually goes to the SVC and the SVC is the only host of all other storages. From time to time I see a cabling like the one below. The devices are connected in a common pattern. For example SVC ports are always on port 0, 4, 8, ... or for a director for example on port 0 and 16 on each card... Something like that. The reason behind that is often to spread the workload over several cards/ASICs to minimize impact in case of a hardware failure. But there's a risk in doing so.
Index Port Address Media Speed State Proto
0 0 190000 id 8G Online FC F-Port 50:05:07:68:01:40:a2:18
1 1 190100 id 8G Online FC F-Port 20:14:00:a0:b8:11:4f:1e
2 2 190200 id 8G Online FC F-Port 20:16:00:80:e5:17:cc:9e
3 3 190300 id 8G Online FC E-Port 10:00:00:05:1e:0f:75:be "fcsw2_102" (downstream)
4 4 190400 id 8G Online FC F-Port 50:05:07:68:01:40:06:36
5 5 190500 id 8G Online FC F-Port 20:04:00:a0:b8:0f:bf:6f
6 6 190600 id 8G Online FC F-Port 20:16:00:a0:b8:11:37:a2
7 7 190700 id 8G Online FC E-Port 10:00:00:05:1e:34:78:38 "fcsw2_92" (downstream)
8 8 190800 id 8G Online FC F-Port 50:05:07:68:01:40:05:d3
The SAN perspective
In the situation described above, all host traffic is passing the ISLs from the edge switches to the core. ISLs are logically "partitioned" into so called virtual channels. Of course the ISL is still just one fibre and only one signal is passing it physically at the same time. The virtual channels are just portions of buffer credits dedicated and the decision which virtual channel a frame takes - and therefore which portion of the buffers credits it uses - is made by looking into the destination fibre channel address.
Technical deep dive
A normal non-QOS ISL has 4 virtual channels for data traffic. For an 8G link each one of them has 5 buffers. They can only work with these 5 buffers and there is no possibility to "borrow" some out of a common pool like for QoS links. With the command "portregshow" you can see the buffer credits assigned to the virtual channels (I added the first line):
VC 0 1 2 3 4 5 6 7
0xe6692400: bbc_trc 4 0 5 5 5 5 1 1
Only VCs 2-5 are used for data traffic. This makes 20 usable buffers which normally should be enough for a normal multimode connection between two switches in the same room with only some metres cable length. Basically the switch uses the last two bits of the second byte of the destination address. That looks so:
Bits 00 -> frame uses VC 2 (which is the first virtual channel for data)
Bits 01 -> frame uses VC 3
Bits 10 -> frame uses VC 4
Bits 11 -> frame uses VC 5
So where's the problem now?
In our imaginary core-edge fabric where for example all SVC ports are connected to ports 0 (bin 00), 4 (bin 100), 8 (bin 1000), 12 (bin 1100) , ... all host I/O towards SVC would use the same virtual channel. As this is the only traffic that passes the ISLs from edges to cores, only a quarter of the buffers are actually used! 5 buffers are very heavy in use and 15 are idling around never to be filled. And 5 buffers are actually pretty few for an edge switch full of hosts wants to speak with the core switch where the SVC is connected. The result would be credit starvation and congestion on a virtual channel level.
How to solve that?
There are 3 possibilities:
1.) You could re-cable your SAN in a manner that all VCs are used. But beside of the risk of physical problems and problems introduced by maintenance actions the devices have to learn about the new addresses of the SVC ports. For many operating systems this still means reboots or reconfigurations. It could involve a lot of work and risk for outages.
2.) You could just change the addresses with the portaddress command. This command is usually used in the virtual fabric environment and if you can use it depends on installed firmware and used platform. While it avoids the physical actions, it still has the disadvantages for the hosts because of changed addresses.
3.) The best and least interrupting possibility might be to set the ISLs to LE mode. This is the long distance mode dedicated for links under 10km in length. It will not only put more buffers on the link (40 for user traffic in an 8G link compared with the 20 for a normal 8G E-Port) but will also collapse the 4 user traffic VCs to just one. It looks like this then:
VC 0 1 2 3 4 5 6 7
0xe6602400: bbc_trc 4 0 40 0 0 0 1 1
So all buffers and therefore also all buffer credits will be used by the hosts and nothing idles. There will of course be a short interruption while changing the ISL to LE mode but beside of that nothing changes for the hosts, because all the addresses stay the same. This is clearly the way to go in the situation described above.
Just something strange for the end: Some switches are delivered from manufacturing with an alternative addressing pattern. For example port 1 of domain 3 won't have the address 030100 then but something like 030d00. In that case the problem can happen similarly but on other ports. But using LE-mode would solve it in the pretty same way.
Please keep in mind that the whole article relates to a very special (although very common) SAN layout in an SVC-centered environment. This is clearly not a standard action plan for all performance problems but it could help if you have a customer in a situation like this. For any questions, feel free to contact me.
Additionally, please be aware that this is not an SVC problem by itself but will happen with every central storage connected to a switch using a pattern as described above and being used by hosts connected to another switch over an ISL!
Update from May 9th:
I was made aware that readers of this article queried their vendors, maintenance providers or business partners with the idea to just set all their ISLs to LE-mode regardless if the condition as described above is actually met. Because of that, I would like to state more clearly: Using LE-mode as a general approach for your ISLs can cause severe problems!
If the SVC ports are not connected in a way that only one Virtual Channel would be used, it actually makes sense to have ISLs with more than one VC. Virtual Channels are a good feature to prevent that a latency bottleneck due to back pressure impairs the traffic of all devices using the same ISL. If devices on the edge switches communicate with other devices connected to other ports of the core (or other edges) as well, the impact of using LE-mode would be even more extreme in the case of slow drain devices.
I made some drawings to illustrate this. The first one shows 1 normal ISL between the edge and the core. You can see the 4 VCs used for data traffic. (I left out the other VCs for better visibility):
Here host 1 and 2 make traffic against the SVC (green), host 3 against an additional disk subsystem (purple) and host 4 against a tape drive (orange). Based on the ports these devices are connected to, other VCs are used for that traffic.
If you would use an LE-port instead, it would look like this:
Now all 4 data traffic VCs collapsed to a single one. As long as everything runs smoothly, you won't see an impact.
Buf if for example one of the devices connected to the core is slow draining, following will happen most probably:
In the picture above the purple disk is a slow drain device. Due to back pressure the whole ISL will be a latency bottleneck, because all data traffic shares the same VC in LE-mode. The back pressure goes further towards the edge switch and all 4 hosts of our example are affected now although only host 3 communicates with the slow drain device!
With a normal E-port it looks like this:
Now only VC4 is affected while VC2, 3 and 5 are running smoothly, because they have their own, unaffected buffer management. Therefore only host 3 will face a performance problem while the hosts 1, 2 and 4 are running fine.
You see: Using LE-mode for the purpose described in my original article does only make sense if these special conditions are really met. In all other cases it can impair the SAN performance tremendously!
I was asked where to look at in a switch to find the average frame size for a port. The safest way would be to use an external monitoring tool like a VirtualWisdom or a tracer as described in my LD mode article but if you don't own something like that you can get a good guess from the switches themselves. You just have to calculate it out of the number of frames and the number of bytes transferred.
For Cisco it's easy. Just look into the "show interface" for the specific port and you'll find the both numbers in the statistics section for each interface:
1887012 frames input, 1300631486 bytes
542470 frames output, 482780325 bytes
So we can just calculate the average frame sizes for both directions:
1300631486 bytes / 1887012 frames = 689 bytes per frame
482780325 bytes / 542470 frames = 890 bytes per frame
For Brocade switches you can get the information out of the portstatsshow command:
stat_wtx 35481072 4-byte words transmitted
stat_wrx 70173758 4-byte words received
stat_ftx 1111087 Frames transmitted
stat_frx 1177665 Frames received
Here we don't have the plain bytes but 4-byte words. Don't worry - fillwords don't count into this number, so it's still valid for our calculation. We just have to multiply it by four to use it:
(35481072 * 4) bytes / 1111087 frames = 128 bytes per frame
(70173758 * 4) bytes / 1177665 frames = 238 bytes per frame
It's really that easy?
Basically yes. With this average frame size you can find out the multiplier for the buffer credits settings. So if you have an average frame size of 520 and a link of 30 km, just calculate:
2112 (the max frame size) / 520 = 4
So you would set up the link for 120 km instead of 30 km to reserve a sufficient amount of buffers. That's it.
A last catch
If you read my article about bottleneckmon you probably already know that we work with 32 bit counters here. While they cover a few hours for the frames they wrap much quicker for the 4-byte words. So to be able to calculate an average frame size over several hours or days, 32 bit counters are not enough. Actually there are 64 bit counters for these values in the switches - although they are not part of a supportsave. The command portstats64show provides them. The first thing to keep in mind: While in the latest FabOS versions a statsclear resets these counters as well, you had to reset them with portstatsclear in the older versions.
The 64 bit counters are actually two 32 bit counters and the lower one ("bottom_int") is the 32 bit counter we used all the time in portstatsshow. But each time it wraps, it increases the upper one ("top_int") by 1. So after a while you might see an portstats64show output like this:
stat64_wtx 0 top_int : 4-byte words transmitted
2308091032 bottom_int : 4-byte words transmitted
stat64_wrx 39 top_int : 4-byte words received
1398223743 bottom_int : 4-byte words received
stat64_ftx 0 top_int : Frames transmitted
9567522 bottom_int : Frames transmitted
stat64_frx 0 top_int : Frames received
745125912 bottom_int : Frames received
For the received frames it's then:
(2^32 * 39 + 1398223743) * 4 bytes / 745125912 frames = 907 bytes per frame.
Much manual computing, hmm?
Of course you could write a script for that or prepare a spreadsheet but my recommendation is still to start with a multiplier of 3 for normal open systems traffic and check with the command portbuffershow how many buffers are still available. And if you still have some, use them - but keep them in mind if you connect additional long distance ISLs or devices you want to give additional buffers as well.
Update Nov. 2nd 2012:
I was made aware that there is an easier and much more convenient way to use portstats64show: Just use the -long option.
pfe_ODD_B40_25:root> portstats64show 26
stat64_wtx 7 top_int : 4-byte words transmitted
485794041 bottom_int : 4-byte words transmitted
stat64_wrx 13 top_int : 4-byte words received
2521709207 bottom_int : 4-byte words received
pfe_ODD_B40_25:root> portstats64show 26 -long
stat64_wtx 30557972957 4-byte words transmitted
stat64_wrx 58371265974 4-byte words received
Much better, isn't it? Thanks to Martin Lonkwitz!