Comments (4)
  • Add a Comment
  • Edit
  • More Actions v
  • Quarantine this Entry

1 scott.fuhrman commented Permalink

Good post on the war against slow draining devices. We implemented short edge-timers on edge switches as soon as the feature was available and have seen good success against the rogue slow draining device acting up in the middle of the night. It's not the eliminator of the root cause, but can certainly limit the impact until the real issue can be addressed. Always better to impact one device than an entire fabric. <div>&nbsp;</div> I like the resilient fabric paper but feel the recommended starting timer values are too conservative for large fabrics. We use strict values (80ms) on edge switches/directors, and 220ms on core directors. With FOS 7, it has changed to a low (80ms)/medium (220ms) / high (500ms) setting. I figure even with the low setting, it it is taking 80ms to return credits, something is definitely wrong. <div>&nbsp;</div> One note to add about enabling portcfgfportbuffers - it does reset the link when enabled, so everyone be aware of this and pick a good time to do it.

2 seb_ commented Permalink

Hi Scott, thanks for your feedback. Strange... for the switches I checked, the mnimum EHT was 100ms. But maybe that changed in other FOS versions. Good to see that somebody really uses it to make the fabric more robust in case of back pressure due to a slow drain device. <br /> Good point about the portcfgportbuffers, too. I have to admit, that while I would definitely put that statement in an action plan for a case, I seem to leave out such information in the blog. Basically everything in here is just "food for thoughts", but you are absolutely right, I should have mentioned that it's disruptive for the link. Changing the buffer credit is a major change of the link configuration requiring a complete new relogin to become effective. So only do this in a controlled manner and if you can rely on your configuration and multipath drivers - or do it in a maintenance window.

3 dlutz commented Permalink

One of my work colleagues referred this article to me. <div>&nbsp;</div> Not sure I agree with your point 5 about frames backing up to the appliance. Out bound congestion shouldn't really cause congestion for the inbound traffic. I suppose it would depend on the appliance but since you referenced SVC with its large in bound buffer of 40 credits I don't see it happening. <div>&nbsp;</div> Instead I see that the when the SVC gets congested as you very well described it will start to queue up outbound workload internally and that workload would be for all hosts using the SVC. So their I/O response times would go up possibly to the point where they start to see I/O timeouts. <div>&nbsp;</div> I too like the reduction of the edge hold time, but of course we really want to prevent the congestion before it gets so bad that were are throwing away frames and forcing retries. <div>&nbsp;</div> I have only seen limited success with increasing the number of port buffers the SVC connects (we typically recommend 40) if the congestion gets that high then more buffers aren't likely going to let you ride it out. <div>&nbsp;</div> One of our recommendations is to add more ISL links between the switches. Fabrics are getting so big and fast that interswitch connections are just about bandwidth but more about frame flow. With exchange based routing the more links the better, but we always recommend using 2 link trunks for the ISLs for availability reasons. <div>&nbsp;</div> I have also seen some good results with port fencing, but you need to be willing to sacrifice the "bad actor" for the greater good. <div>&nbsp;</div> After that bottleneck detection and fabric watch monitoring (for high link utilization) are the next best defense. Another metric you could monitor is the SVC buffer to buffer time zero counts (bbtzc) via TPC.

4 seb_ commented Permalink

Thanks for your feedback. I'm not sure if we are on the same wavelength for point 5. It's just back pressure, after the appliance turned to a slow drain device itself. Not "too much frames" for the physical link (=congestion) but credit starvation. Unfortunately I saw that in several cases. (Like almost all the problems I describe here. In tech support you see a lot of stuff nobody thought could happen in real life.) I agree to the other points you make. The credit stats inside the SVC are very helpful in performance analysis. Seeing both sides of the vital SVC links helps a lot.