Reach this article at https://ibm.biz/stuckvc.
It's summertime again and for some of our customers it's the time to do their Fabric OS updates. Maybe you want to do that, too? I personally recommend a six month interval to go to the latest or the latest "mature" code, depending on your policy.
When you update to one of the latest v6.3x, v6.4x or v7x codes you might see your switch error log flooded by a new error message after the update:
2012/06/12-07:01:34, [CDR-1011], 1001, SLOT 6 | CHASSIS, WARNING, M48Fab1, S5,P-1(35): Link Timeout on internal port ftx=10203920 tov=2000 (>1000) vc_no=16 crd(s)lost=3 complete_loss:1
This was for a 2109-M48 (Brocade 48000) with a Condor ASIC. For a DCX with Condor2 ASICs it would look like this:
2012/06/12-10:45:11, [C2-1012], 9482, SLOT 7 | CHASSIS, WARNING, DCXFab1, S1,P-1(3): Link Timeout on internal port ftx=39298539 tov=2000 (>1000) vc_no=16 crd(s)lost=1 complete_loss:1
Did the update break something?
No. Brocade just implemented a check for "stuck VCs" and it found one in your director. So it was there before but now after the update the Fabric OS is able to point at it and generates a warning message about it.
What is a stuck VC?
I explained VCs (Virtual Channels) a bit in the updated version of my article about"How to NOT connect an SVC in a core-edge Brocade fabric" and the one about Quality of Service. As I wrote there, each VC has its own buffer management - its own buffer credit counter and special VC-related 4-byte words (VC_RDYs) that re-fill only the buffer credits of a certain VC. A normal link to a device usually has only one buffer credit management and if the buffer credits are lost over time, performance usually decreases until the last buffer credit is lost, a link reset will be issued after 2 seconds to re-gain the credits. Internal backlinks between cards in a director could loose buffer credits, too. But as they can only loose a buffer credit belonging to a VC, other VCs may still have buffer credits. So while the other VCs continue to run without any problems, only the VC which lost credits is affected. It's a so called "stuck VC" now.
Wait! How can buffer credits be lost?
There are some reasons but I think the likeliest and most understandable one is a bit error corrupting the VC_RDY. If a bit is flipped in the VC_RDY the receiving port cannot recognize it anymore. The credit is lost. But "a few" bit errors are acceptable even in the Fibre Channel protocol. So this can happen even if everything works within the specs. The important thing is to detect it and react properly.
So I get these new messages and they tell me I have a problem. What now?
With FabOS v6.4.2a (and v6.3.2d, v7.0.0) Brocade extended the bottleneckmon command with an additional agent. This agent reacts on stuck VC conditions by doing a link reset on the specific backlink. This is a big improvement compared with the older codes. Stuck VCs on internal links between two blades required to reseat one of the blades or to power it off for a moment.
But it's disabled and you have to switch it on!
To enable it, run:
bottleneckmon --cfgcredittools -intport -recover onLrOnly
Once enabled the agent will monitor the internal links and if there is a 2-second window without any traffic on a backlink with a stuck VC, it will reset it to solve the stuck VC. This approach minimizes the impact of the link reset. But it still could happen that you see a few aborts in the host logs - which is usually self-recoverable. After that the messages should stop and you can use the full internal bandwidth of your switch again.
Please have a look into the help page of the bottleneckmon command ("help bottleneckmon") for more information. And if you still get messages pointing to lost credits, please open a case and we'll have a look.