Slowly Draining Away…

Plumbers define a slow drain as one in which your teenage-kid has tried to wash clump of hair vaguely resembling a tribble down.

Ed Mazurek from Cisco TAC defines it quite differently:

…When there are slow devices attached to the fabric, the end devices do not accept the frames at the configured or negotiated rate. These slow devices, referred to as “slow drain” devices, lead to ISL credit shortage in the traffic destined for these devices and they can congest significant parts of the fabric. 

Having fought this fight, (and being that I’m *STILL* fighting this fight) I can say quite simply that when you have a slow-drain device on your fabric, you have a potential land-mine…

My (Non TAC) take is this:

Say I have the following configuration:

HostA is a blade server with 2x 4G HBAs
StorageA is an EMC VMAX with 8G FA ports.
HostA is zoned to Storage A with a 2:1 ratio.  (1 HBA – 2 FA’s)

Now when Host-A requests storage from Storage-A. (for instance say you have an evil software/host-based replication package that shall remain nameless that likes to do reads and rights in 256K blocks.) Storage-A is going to assemble and transmit the data as fast as it can.  *IF* you have 32G of storage bandwidth staring down the barrel of 8G of host bandwidth, the host might not be able to accept the data as fast as the storage is sending it.  The Host, has to tell the switch that it’s ready to receive data (Receive-Ready or “R_RDY”)  The switch tries to hold on to the data as long as it can, but there are only so many buffer credits on a switch.  If the switch runs out of buffer credits, this can affect other hosts on that switch, or if it’s bad enough, even across the fabric.

So it’s possible to find yourself having trouble with one host, only to have EMC or Cisco support point the finger at a completely unrelated host and telling you “That’s the offender, kill it dead.”

Symptoms

  • Random SCSI Aborts on hosts that are doing relatively low IO.

When a slow-drain is affecting your fabric, IO simply isn’t moving efficiently across it.  In bigger environments, like a Core-Edge environment, you’ll find that you’ll see random weirdness on a completely unrelated server, on a completely unrelated switch.  The slow-drain device is, in that situation, causing ISL traffic to back-up to (and beyond) the ISL, and is causing other IO to get held because the ISL can’t move the data off the core switch.  So in that case, a host attached to Switch1, can effectively block traffic from moving between Switch2 and Switch3.  (Because Switch2, being the core switch, is now ALSO out of B2B credits.)

The default Timeout value for waiting for B2B credits is 500ms.  After that, the switch will drop the IO, forcing the storage to re-send.  If the host doesn’t receive the requested data within the HBA configured timeout value, the host will send a SCSI abort to the array (you’ll see this in any trace you pull)

Now the Array will respond to the host’s ABTS, and resend the frames that were lost. Here’s the kicker, if the array’s response gets kludged up in that the host will try to abort again, forcing the array to RESEND the whole thing one more time.

After a pre-configured # of abort attempts, the host gives up and flaps the link.

  • Poor performance in otherwise well-performing hosts.

The hardest part about this one, is that the host IOSTAT will say it’s waiting for disk, and the Array will show the usual 5-10ms response times, but there really is no good way of measuring how long it takes data to move from one end of the fabric to the other.

I had a colleague who used to swear that (IOStat wait-time – Array Wait Time) = SAN wait time.

The problem with that theory, is that there are so many things that happen between the host pulling Io off the fabric, to it being “ready” to the OS.  (Read/Write queues at the driver level come to mind)


There are a few, rather creative ways to mitigate a slow-drain…

  • You can hard-set the host ports to a lower speed.

Well ok, I’m lying.  This does the opposite of fixing the problem, this masks the issue.  Hard-setting a host down to, say 2GB doesn’t prevent the slow-drain…what it DOES do is prevents the host from requesting data as quickly. (or for that matter as often)   Did this, saw it work, even though every ounce of logic I’ve got says it shouldn’t. (it should, by all measures, make the issue much worse by preventing the host from taking data off the SAN at a much higher rate)

  • You can set the speed of the storage ports down.

Yes, realistically, this will work.  If you absolutely have to, this will help.  By reducing the ratio of storage bandwidth:host bandwidth from 4:1 to 2:1, you are preventing the storage from putting as much data on the network at any given time. This prevents the back-up and should keep B2B credits from running out.  However, there is a simpler issue and that is…

  • 1:1 Zoning

It’s been EMC’s best practice for ages, single-initiator, single-target.  While locking down the storage ports will work, and will alleviate the bandwidth problem, simplifying your zoning will do the same job, and have the added bonus of being easier to manage.  The only downside, is that some people like for the host not to lose ½ of it’s bandwidth when a director fails.  (In the case of 1:2 zoning, you lose ¼ of your bandwidth when a director fails, not ½)

  • Reduce the queue depth on the server

Yes, it will work.  Going from the emc Recommended of 32, to 16, or even 8, restricts the number of IO’s the host can have out on the fabric at any given time.  This will reduce congestion…

And lastly, my favorite:

  • Implement QoS on the array.

EMC supports QoS on the VMAX arrays out of the box.  So if you can, limit each host to the bandwidth that the host HBA’s are capable of.  (if you have multiple arrays, you’ll have to do some clever math to figure out what the best set-point is)  This allows you to continue to use the 1:2 zoning (2FA’s for each HBA) and prevents the slow-drain device from affecting your whole environment..

  • Set the “NO CREDIT DROP TIMEOUT” to 100ms. on each host-edge switch

This one is dangerous – doing this causes the switch to drop IO’s when there are no buffer credits much faster…  This has the upside of forcing a Slow Drain Device to drop on it’s face BEFORE it can affect other hosts, in theory…  But remember that the other hosts are experiencing the same types of timeouts, they’ll get dropped too.

A great article on Cisco.com about what it is, in much more detail than I could hope to get here, in case you need to sleep at night.

Cisco Slow-Drain Whitepaper

By the way, it’s good to be back.

Leave a Reply

Your email address will not be published.