Losing the cloud… (Part 1)

f766c471a5d48e6571a72628843e4710There’s a Dilbert comic strip that I found hilarious a while back…

The hilarious part is there is that the chance of this happening in real life is non-zero.  Not that it is likely to happen, but it’s impossible, statistically speaking, to completely rule out the idea.

Now there are “big” cloud providers like AWS or…well…AWS.  The chances of your datacenter getting lost there is less, they’re not going to disappear, and they’re a pretty together company so the odds are in your favor.

But what if it were to happen?

Say I’m a small business (I am actually) and because I’m cheap, I want to outsource all of my datacenter operations, email, etc, to “Bobs clouds and stuff.”  Email, Database, Custom Widget Application, all of it.

The migration is easy, virtualize my systems and upload them right?  (Or the smarter way is to create new ones and migrate to them, but that’s a different story.)

But what if Bob decides that he’s done, that he’s going to shut everything down and run to aruba because his ex-wife is after him for 10 years of back-child support?  Or comes down with a rash no-one can identify and dies?

Ok, a little far-fetched, but you get the drift.  What’s a small business’ recourse if their cloud provider just folds?  Do you have any?  Can you pay the lawyers to fight out who owns what while you’re not making any money because your entire operation has been “turned off”?

It’s a horrifically overstated problem, but it brings out the potential downside to cloud computing.  You don’t actually have control.  You are putting your data, your livelihood, your company’s very being, in the hands of someone else who may or may-not care.

I’m a control freak.  Anyone who knows me or has tried unsuccessfully to have me committed in the past 20 years knows that.

I want control of my data.  I want it in my hot little hands.  I want to have tapes.  I want to know where they are and I want to have instant access to them at 2am if I wake up and find I’ve had a nightmare about all of my data being gone.

 

Competitive Marketing advice…

Last week I had to sit through one of those “competitive sales pitch” meetings.  You know, where Company A compares their product to Company B and of course, tries to make you draw the conclusion that Company A’s product is light-year ahead of the competition, even if it isn’t.

Now I’m under NDA, so i can’t disclose the brands, or in fact anything about the specs involved, but i can speak to the tone of the meeting.

It was mean, and spiteful, and nasty, and put me off Company A’s product entirely.  (Needless to say, we’re not buying any)

Listen.  I know every hardware vendor things their product is the best thing since sliced bread, (and really, what isn’t right?)  But if you’re going to do a comparison, make it about how great your product is, not how lousy your competitor’s is.  When you do that, you come off as petty, and bitter, and spiteful, and not very believable.

Show me the numbers.  And not the marketing numbers, the real numbers.  You say your array can do 1.5 million IOP/S, show me the breakdown.  You say your switch can do sub-microsecond switching, don’t forget to clarify that that’s only to adjacent ports, you say your backup software can backup a multi-terabyte system, show me that it can restore it as well.

And don’t show me slides with pictures of your parts and talk to me about how much better looking, prettier, well laid out, your hardware is.  It means nothing.  Functionality is everything.  Yes you’ve combined multiple redundant components into one chip, but now, if that one chip fails, you’re losing 8x the functionality.  (IE the only thing you’ve taken out of the system is the redundancy.)

I’m a big proponent of “you get what you pay for”  Especially in enterprise systems.  You show me a vendor who is selling their hardware for 10% of what another “comparable” vendor is, and my first question is “what is missing.”

That’s all.

</rant>

Cutting the ends off the roast…

When I used to teach, I always told the story of making the roast. It’s a parable, but it works.

As follows:

I was making a roast one day, and I cut the ends off it before i put it in the pan. My kid asks “Why did you cut the ends off the roast?”

“Because that’s how my mom did it.”

Curiosity got the better of me and I asked my mom “Mom, why do you cut the ends off a roast when you make it?”

“Because that’s how Grandma did it.”

Again, curiosity – I call my grandmother and ask HER: “Grandma, why do you cut the ends of the roast?”

“Oh, well my pan is too short.”

<head meets desk>

There is an inherent danger in doing the things the way they’ve always been done without giving thought to why. Situations change, technology evolves, and suddenly the “way you’ve always done it” bcomes the most inefficient way possible because some new method has come along, or even worse, becomes the WRONG way to do something because the underlying technology has changed.

“Hard” vs. “Soft” zoning comes to mind.  No-one in their right mind does hard-zoning anymore…Most vendors discourage it, a few won’t even support it.

But 15 years ago, it was best practice.  Things change, technology changes, so people MUST change along with it.

Slowly Draining Away…

Plumbers define a slow drain as one in which your teenage-kid has tried to wash clump of hair vaguely resembling a tribble down.

Ed Mazurek from Cisco TAC defines it quite differently:

…When there are slow devices attached to the fabric, the end devices do not accept the frames at the configured or negotiated rate. These slow devices, referred to as “slow drain” devices, lead to ISL credit shortage in the traffic destined for these devices and they can congest significant parts of the fabric. 

Having fought this fight, (and being that I’m *STILL* fighting this fight) I can say quite simply that when you have a slow-drain device on your fabric, you have a potential land-mine…

My (Non TAC) take is this:

Say I have the following configuration:

HostA is a blade server with 2x 4G HBAs
StorageA is an EMC VMAX with 8G FA ports.
HostA is zoned to Storage A with a 2:1 ratio.  (1 HBA – 2 FA’s)

Now when Host-A requests storage from Storage-A. (for instance say you have an evil software/host-based replication package that shall remain nameless that likes to do reads and rights in 256K blocks.) Storage-A is going to assemble and transmit the data as fast as it can.  *IF* you have 32G of storage bandwidth staring down the barrel of 8G of host bandwidth, the host might not be able to accept the data as fast as the storage is sending it.  The Host, has to tell the switch that it’s ready to receive data (Receive-Ready or “R_RDY”)  The switch tries to hold on to the data as long as it can, but there are only so many buffer credits on a switch.  If the switch runs out of buffer credits, this can affect other hosts on that switch, or if it’s bad enough, even across the fabric.

So it’s possible to find yourself having trouble with one host, only to have EMC or Cisco support point the finger at a completely unrelated host and telling you “That’s the offender, kill it dead.”

Symptoms

  • Random SCSI Aborts on hosts that are doing relatively low IO.

When a slow-drain is affecting your fabric, IO simply isn’t moving efficiently across it.  In bigger environments, like a Core-Edge environment, you’ll find that you’ll see random weirdness on a completely unrelated server, on a completely unrelated switch.  The slow-drain device is, in that situation, causing ISL traffic to back-up to (and beyond) the ISL, and is causing other IO to get held because the ISL can’t move the data off the core switch.  So in that case, a host attached to Switch1, can effectively block traffic from moving between Switch2 and Switch3.  (Because Switch2, being the core switch, is now ALSO out of B2B credits.)

The default Timeout value for waiting for B2B credits is 500ms.  After that, the switch will drop the IO, forcing the storage to re-send.  If the host doesn’t receive the requested data within the HBA configured timeout value, the host will send a SCSI abort to the array (you’ll see this in any trace you pull)

Now the Array will respond to the host’s ABTS, and resend the frames that were lost. Here’s the kicker, if the array’s response gets kludged up in that the host will try to abort again, forcing the array to RESEND the whole thing one more time.

After a pre-configured # of abort attempts, the host gives up and flaps the link.

  • Poor performance in otherwise well-performing hosts.

The hardest part about this one, is that the host IOSTAT will say it’s waiting for disk, and the Array will show the usual 5-10ms response times, but there really is no good way of measuring how long it takes data to move from one end of the fabric to the other.

I had a colleague who used to swear that (IOStat wait-time – Array Wait Time) = SAN wait time.

The problem with that theory, is that there are so many things that happen between the host pulling Io off the fabric, to it being “ready” to the OS.  (Read/Write queues at the driver level come to mind)


There are a few, rather creative ways to mitigate a slow-drain…

  • You can hard-set the host ports to a lower speed.

Well ok, I’m lying.  This does the opposite of fixing the problem, this masks the issue.  Hard-setting a host down to, say 2GB doesn’t prevent the slow-drain…what it DOES do is prevents the host from requesting data as quickly. (or for that matter as often)   Did this, saw it work, even though every ounce of logic I’ve got says it shouldn’t. (it should, by all measures, make the issue much worse by preventing the host from taking data off the SAN at a much higher rate)

  • You can set the speed of the storage ports down.

Yes, realistically, this will work.  If you absolutely have to, this will help.  By reducing the ratio of storage bandwidth:host bandwidth from 4:1 to 2:1, you are preventing the storage from putting as much data on the network at any given time. This prevents the back-up and should keep B2B credits from running out.  However, there is a simpler issue and that is…

  • 1:1 Zoning

It’s been EMC’s best practice for ages, single-initiator, single-target.  While locking down the storage ports will work, and will alleviate the bandwidth problem, simplifying your zoning will do the same job, and have the added bonus of being easier to manage.  The only downside, is that some people like for the host not to lose ½ of it’s bandwidth when a director fails.  (In the case of 1:2 zoning, you lose ¼ of your bandwidth when a director fails, not ½)

  • Reduce the queue depth on the server

Yes, it will work.  Going from the emc Recommended of 32, to 16, or even 8, restricts the number of IO’s the host can have out on the fabric at any given time.  This will reduce congestion…

And lastly, my favorite:

  • Implement QoS on the array.

EMC supports QoS on the VMAX arrays out of the box.  So if you can, limit each host to the bandwidth that the host HBA’s are capable of.  (if you have multiple arrays, you’ll have to do some clever math to figure out what the best set-point is)  This allows you to continue to use the 1:2 zoning (2FA’s for each HBA) and prevents the slow-drain device from affecting your whole environment..

  • Set the “NO CREDIT DROP TIMEOUT” to 100ms. on each host-edge switch

This one is dangerous – doing this causes the switch to drop IO’s when there are no buffer credits much faster…  This has the upside of forcing a Slow Drain Device to drop on it’s face BEFORE it can affect other hosts, in theory…  But remember that the other hosts are experiencing the same types of timeouts, they’ll get dropped too.

A great article on Cisco.com about what it is, in much more detail than I could hope to get here, in case you need to sleep at night.

Cisco Slow-Drain Whitepaper

By the way, it’s good to be back.

Recovering from a windows AD failure…

A couple of years ago my PDC died.  The only physical box in my environment and the one physical server died.

I was 2,700 miles away.  I wasn’t going to be back any time soon, and stuff was broken.  (Thankfully, customer data was on the Linux Webhosting environment, so nothing lost there, except their backups)

My setup involves 1 physical server, and about 14VM’s (on two physical hosts)  The physical server does a number of things.  In addition to being the PDC/Infrastructure Master, etc.  It holds my backups, gives me a plase to run consoles for various management agents…etc.

It died.  Rebooted after a power-failure in the hosted datacenter I was throwing good money away on. (don’t EVEN get me started)

Anyway, technical mumbo-jumbo.

Recovered the original DC as a domain member using the following steps:

1. On DC1, Remove network connection / boot host. *VERY* important…

2. On DC1, Force-removed secondary/tertiary active-directory servers. (DC3, DC4)

3. On DC1, run DCPROMO and removed Active Directory – (There were a couple of minor gotchas to do this – like an idiot I didn’t write them down, but they were easy fixes, easily googleable. (is too a word) This removes all AD membership and makes it a stand alone workstation.

4. Shut down DC1

5. On the new PDC (DC2) removed DC1 as an AD server.

6. On DC1, connect network, boot server.

7. Join DC1 to AD as a domain member.

The quest for 100% uptime…

Are you the type of IT shop that won’t take downtime?  I mean won’t take downtime to the point that there are EOSL applications running on EOSL hardware, redundancy gone because an HBA has failed and a replacement simply isn’t available, (or is and you won’t take the outage to replace it)

It got me to thinking about this quest for “100% uptime”  Is it possible?

In my experience downtime is absolutely required.  Not only is it required, it’s guaranteed.  It’s *GOING* to happen eventually.  Whether it happens on your schedule or on the universe’s schedule is the only thing you have any control over. (and even then, sometimes not)

I’ve found in my experience that Virtualized platforms, VMWare, HyperV, or whatever myriad of platforms are almost a requirement if your aim is to provide for minimal hardware-related outages.  It allows you to move a Guest host from one system to another, replace/upgrade hardware, move it back, etc.  It also allows for a certain level of storage virtualization, allowing the user to move from End-Of-Life storage, keeping things “fresh.”

But then there is the operating system.  To my knowledge, *ALL* Midrange OS’s require patching, upgrades, reboots.  All of them.

When you add to that, the fact that operating systems were written by human beings, and in large part by thousands of human beings, some of whom never talk to one another, well you get the idea.  Computers are an imperfect construction of imperfect beings.

So in short, because I know that TL;DR really is a thing these days.  Don’t promise anyone “100% uptime” because if you do, you’re a liar.  It simply can-not happen.

Privacy In The Clouds….

I’m not sure why this never got discussed before, but suddenly, with the “shocking” revelation that the government has been collecting data from the cloud in bulk, the concept of “Privacy” is on everone’s mind.

I’m telling you.  Anyone who thought their “Cloud” storage was secure from prying eyes has deluded themselves with visions of puppy-dogs and unicorns.

Personally I’m not worried about it.  I never expected anything I put in the cloud to be private anyway.

Bottom line, the internet wasn’t designed to be secure, it was designed to be redundant, transparent, resilient.  But when you send information out “to the cloud” you’re trusting your electronic information to equipment that other people own/control, and as such have no guarantee as to the security of your data.

I’ve gotten into my share of “discussions” on news message boards when Edward Snowden broke the news that the NSA was spying on Americans… (Duh)   When I tell people “I don’t care” and “I assumed it was anyway” I got lambasted.

So how *DO* you secure your data?

Endpoint Encryption

The only way to be reasonably sure that your data is secure in transit is to implement endpoint encryption.  Where you have an encryption device on the source, another on the target, and if you *REALLY* want to be secure, you’ve HAND CARRIED your encryption keys from Point-A to Point-B..  (Sending your private key over the email is, by definition, stupid.)

Then, you’re only at the mercy of Barracuda, Cisco, EMC, or whomever built your encryption appliance.  Here’s a thought though… Do you know there is no back-door to decrypt data?  How do you?  the code that runs on these appliances are proprietary, you don’t know ANYTHING about the internal code, and I’m sure none of the above will release the source-code to you for inspection, (nor do you have any reasonable assurance that the source code you’re shown is what’s compiled and running on your encryption appliance).  Again, it’s a matter of trust, but there is always the possibility.

Closed System

This is the only real chance for security.  A campus-wide, closed system, with no external connection to the internet, optical (as opposed to copper) connections between buildings, etc. Is the only REAL chance for security.  But is it worth it?

I had a colleague when I worked for the student loan company (thankfully defunct) that used to say that the best way to secure a system was to turn it off.  He probably wasn’t far from the truth.  When I took my Windows NT MCP certification course (dating myself huh?) my instructor told us that Windows NT was the most secure operating system on the planet, provided the computer wasn’t connected to a network.  (Then, presumably, all bets were off)

In Conclusion, as long as you know that, the more of your application/data you put in the “cloud” the more vulnerable you are to plundering, not to mention outages that are completely out of your control (right Amazon AWS?)

If you keep your data in-house, under your control, not only do you have a neck to choke when you’re system goes down, but you can be reasonably sure of it’s security.

(Unless you plug it into the internet – then all bets are off)

 

For sale:

Just wondering if there is any interest out there. I have three working storage arrays I’m looking for a new home for.

1 Clariion cx300
1 Clariion cx500
1 Celerra ns500

The ns500 can be split (I have the cables to split the back-end) but I want to sell it as one unit.

Everything works, though while I have cables, I don’t have SPS units for them.  (Those are pretty cheap/easy to find.

The cx500 and ns500 have 146g vault sets, the cx300 has 73g vault drives.

I have a bunch of 73g and 146g drives available to go with if you’re similarly interested.

Make me an offer, but expect on $150 to $200 in shipping charges.

Please know how to set them up. Im not signing on for lifetime support.

The “Public” Cloud…

It’s really easy to point-and-click yourself to a leased cloud or public cloud infrastructure.  Throw a credit-card # in and you can start working immediately.

But it has to come down to the real question.

Should you?

Putting your application “In The Cloud” whether it be Google’s new service, Amazon, or anyone who you rent a few hours of CPU time from can be the easiest way to start something big.

But what’s the downside?  Is there one?

Anyone who knows me knows, I’m a bit OCD.  I want to be in control.  When you put your application “in the cloud” you are putting your faith in someone you don’t know, in a building you’ve never seen, using hardware you have no idea about.  Most “Cloud” providers use hardware that barely qualifies as “Enterprise”

An example, the place I used to rent space from, (whom I will not mention right now because I don’t want a subpoena from them) used the CRAPPIEST supermicro servers, single power-supply, single-harddrive, no backup infrastructure, etc.

This was their cloud.  The *STATED* (Yes, they actually told me this) logic being that it’s cheaper to settle with the occasional customer over a failure than it is to buy real hardware.

Now I’m not, but any extension, saying that they are all like that.  I’m saying that if you haven’t seen the hardware, you have no idea where they’re putting you, and what the reliability rate is.  You only know what their sales-people / website tells you.

Anyone who believes marketing at face value needs to talk to me about this bridge I’ve got for sale in the everglades…

If you want to *KNOW* it’s done right, you have to do it yourself.  Bottom line.  Anything else is faith.

If you want the flexibility of a capital-C Cloud infrastructure, build one yourself.  VMWare, EMC, Cisco, or pick your brand.

Just don’t delude yourself about what you’re getting.  If you buy cheap, you get cheap.  It’s the law.

 

Overwhelmed…

To say I’m overwhelmed is an understatement.  Hip deep in a major Tech-Refresh for a (non-EMC) vendor that is sucking my life dry, I sometimes forget that this blog is ever here. (As evidenced by the lack of activity)

On the blog/webhosting front things have been interesting.  CATBytes is hosting about 50 users or so.  Mostly informally, just bloggers and the like looking for a cheap place to park their wordpress sites.

I guess the part of it I forgot about was security.  I am *NOT* a big security wonk, and I’m learning this stuff as I go.  One of my users used a simple password and allowed their site to be hacked, and while that SHOULDN’T have been a big deal, it allowed some user to start sending out Denial-Of-Service attacks using one of my webservers.

For about a month.

And It didn’t occur to me because I wasn’t getting any complaints about bandwidth, speed, etc. (my equipment is good, my internet uplink is good, so it was hardly noticible.

Until the bill came.  See I pay $38/MB for a 10MB commit, but it’s a 100MBit pipe.  They don’t bill me the extra bandwidth so long as I don’t exceed my 10MB for more than like 5%.  And normally I don’t, by a long-shot.

Except for this month.  And since the hack managed to straddle two billing cycles, It double-hit me.

Now my provider “Neglected” to tell me about this overage until months later, stating that they had a glitch in their billing.  But going 90Mbit over my 10 for almost 30 solid days makes for a SEVEN THOUSAND DOLLAR bandwidth bill.

Crap.  So now I’ve rapidly taught myself how to limit bandwidth in VMWare (something I should have been doing the whole time) but I have a mad fight on my hands to try and get this provider to see that they’ll bankrupt me if they pursue this, and that won’t be good for either of us.

I hope they see logic.  Because if not, I have to explain to 50-75 bloggers why their sites are going down.  And I *WILL* name names.