Plumbers define a slow drain as one in which your teenage-kid has tried to wash clump of hair vaguely resembling a tribble down.
Ed Mazurek from Cisco TAC defines it quite differently:
…When there are slow devices attached to the fabric, the end devices do not accept the frames at the configured or negotiated rate. These slow devices, referred to as “slow drain” devices, lead to ISL credit shortage in the traffic destined for these devices and they can congest significant parts of the fabric.
Having fought this fight, (and being that I’m *STILL* fighting this fight) I can say quite simply that when you have a slow-drain device on your fabric, you have a potential land-mine…
My (Non TAC) take is this:
Say I have the following configuration:
HostA is a blade server with 2x 4G HBAs
StorageA is an EMC VMAX with 8G FA ports.
HostA is zoned to Storage A with a 2:1 ratio. (1 HBA – 2 FA’s)
Now when Host-A requests storage from Storage-A. (for instance say you have an evil software/host-based replication package that shall remain nameless that likes to do reads and rights in 256K blocks.) Storage-A is going to assemble and transmit the data as fast as it can. *IF* you have 32G of storage bandwidth staring down the barrel of 8G of host bandwidth, the host might not be able to accept the data as fast as the storage is sending it. The Host, has to tell the switch that it’s ready to receive data (Receive-Ready or “R_RDY”) The switch tries to hold on to the data as long as it can, but there are only so many buffer credits on a switch. If the switch runs out of buffer credits, this can affect other hosts on that switch, or if it’s bad enough, even across the fabric.
So it’s possible to find yourself having trouble with one host, only to have EMC or Cisco support point the finger at a completely unrelated host and telling you “That’s the offender, kill it dead.”
Symptoms
- Random SCSI Aborts on hosts that are doing relatively low IO.
When a slow-drain is affecting your fabric, IO simply isn’t moving efficiently across it. In bigger environments, like a Core-Edge environment, you’ll find that you’ll see random weirdness on a completely unrelated server, on a completely unrelated switch. The slow-drain device is, in that situation, causing ISL traffic to back-up to (and beyond) the ISL, and is causing other IO to get held because the ISL can’t move the data off the core switch. So in that case, a host attached to Switch1, can effectively block traffic from moving between Switch2 and Switch3. (Because Switch2, being the core switch, is now ALSO out of B2B credits.)
The default Timeout value for waiting for B2B credits is 500ms. After that, the switch will drop the IO, forcing the storage to re-send. If the host doesn’t receive the requested data within the HBA configured timeout value, the host will send a SCSI abort to the array (you’ll see this in any trace you pull)
Now the Array will respond to the host’s ABTS, and resend the frames that were lost. Here’s the kicker, if the array’s response gets kludged up in that the host will try to abort again, forcing the array to RESEND the whole thing one more time.
After a pre-configured # of abort attempts, the host gives up and flaps the link.
- Poor performance in otherwise well-performing hosts.
The hardest part about this one, is that the host IOSTAT will say it’s waiting for disk, and the Array will show the usual 5-10ms response times, but there really is no good way of measuring how long it takes data to move from one end of the fabric to the other.
I had a colleague who used to swear that (IOStat wait-time – Array Wait Time) = SAN wait time.
The problem with that theory, is that there are so many things that happen between the host pulling Io off the fabric, to it being “ready” to the OS. (Read/Write queues at the driver level come to mind)
There are a few, rather creative ways to mitigate a slow-drain…
- You can hard-set the host ports to a lower speed.
Well ok, I’m lying. This does the opposite of fixing the problem, this masks the issue. Hard-setting a host down to, say 2GB doesn’t prevent the slow-drain…what it DOES do is prevents the host from requesting data as quickly. (or for that matter as often) Did this, saw it work, even though every ounce of logic I’ve got says it shouldn’t. (it should, by all measures, make the issue much worse by preventing the host from taking data off the SAN at a much higher rate)
- You can set the speed of the storage ports down.
Yes, realistically, this will work. If you absolutely have to, this will help. By reducing the ratio of storage bandwidth:host bandwidth from 4:1 to 2:1, you are preventing the storage from putting as much data on the network at any given time. This prevents the back-up and should keep B2B credits from running out. However, there is a simpler issue and that is…
- 1:1 Zoning
It’s been EMC’s best practice for ages, single-initiator, single-target. While locking down the storage ports will work, and will alleviate the bandwidth problem, simplifying your zoning will do the same job, and have the added bonus of being easier to manage. The only downside, is that some people like for the host not to lose ½ of it’s bandwidth when a director fails. (In the case of 1:2 zoning, you lose ¼ of your bandwidth when a director fails, not ½)
- Reduce the queue depth on the server
Yes, it will work. Going from the emc Recommended of 32, to 16, or even 8, restricts the number of IO’s the host can have out on the fabric at any given time. This will reduce congestion…
And lastly, my favorite:
- Implement QoS on the array.
EMC supports QoS on the VMAX arrays out of the box. So if you can, limit each host to the bandwidth that the host HBA’s are capable of. (if you have multiple arrays, you’ll have to do some clever math to figure out what the best set-point is) This allows you to continue to use the 1:2 zoning (2FA’s for each HBA) and prevents the slow-drain device from affecting your whole environment..
- Set the “NO CREDIT DROP TIMEOUT” to 100ms. on each host-edge switch
This one is dangerous – doing this causes the switch to drop IO’s when there are no buffer credits much faster… This has the upside of forcing a Slow Drain Device to drop on it’s face BEFORE it can affect other hosts, in theory… But remember that the other hosts are experiencing the same types of timeouts, they’ll get dropped too.
A great article on Cisco.com about what it is, in much more detail than I could hope to get here, in case you need to sleep at night.
By the way, it’s good to be back.
4 comments
Skip to comment form
It’s funny that we ignored B2B in a product of so long (due to Cisco/Brocade differences in reporting) but once we started collecting that, we can literally detect a slow-drain and find the 5 most likely HBAs causing it in literally 10 seconds.
I would keep your focus on B2B, but also if you can see inside the FC frames, look for an ABTS on the link rather than at the host. You’re absolutely correct that the host’s reported metrics involves many more things than just SAN latency. Recall, too, that some drivers will auto-retry an aborted transaction but not tell the host. This can be seen on the actual FC link if you can see it. a JDSU Xgig is fine if you’re fast with trace view, but you probably want to be alerted to it, not have to connect an Xgig and wait for it to recur. It’ll happen on another link, then we begin the whack-a-mole game. at $250k, a fully-loaded Xgig becomes expensive to add to everything.
Watch also Queue Depth: if you could skim around your SAN, you’d see that too many hosts with high queue depth and high exchange size can choke your FAs in a heartbeat (literally in milliseconds). I would add to your bag of tricks the habit of checking that new hosts always have proper host-side queue depth set.
Author
We built a script, (with Cisco’s help of course) that looked for ingress queuing on the Cisco ports. It helped greatly to identify because the problems we were seeing were on the ISL’s, not on the host ports. The theory, again as explained by Cisco, was that the Host was unable to take the data off the switch fast enough, causing it to queue up at the switch, then the overflow would spill over to the core switches, causing chaos around the environment. Ironic that the “ideal” core-edge topology would make troubleshooting this particular problem harder… Once the Credit drain hit the core switches, it affected EVERYTHING.
FWIW, I’ve sens reports of great success with QoS on the VMAX simply because checking queue depth is different from enforcing it at the host side. I meant to add to my last comment that QoS on the array has been a great way for the SAN Admin to set a policy and enforce adherence. Rather than have the VMAX go write-pending, the guilty party tends to start complaining about slow throughput, which is a great time for the SAN Admin to remind the caller about the throttling he has so far ignored. Done correctly, it’s a great way to force a conversation that gets avoided. Again, done correctly, it can keep an ally who understands rather than become a we-vs-they. …all the while keeping the SAN performing very well.
If only every storage had that option.
If only that option was in frames/time, not MB/time.
Author
We’ve found that the default setting of 32 works fine for EMC storage, but we have to drop down to 16 on 3Par storage. mpathd isn’t ‘intelligent’ about it’s multipathing, it just throws IO wherever is “next”. Sometimes causing traffic to pile up on one link leaving the others sitting idle.
Thanks for your comment, I didn’t really think anyone browsed this anymore, that it was more of my own private diary…
Thanks. 🙂