Good Cloud, Bad Cloud, a Titanic story…

This weeks abject failure of’s EC2 hosting environment has caused quite the stir.  There are those who say that this proves that this incident “Proves Cloud Failure Recovery is a Myth” and others who say that we should just give it a chance.

Facts are facts.  Amazon screwed the pooch big-time last week.  Their outage caused ripple effects nation-wide.  But while it’s easy to throw the blame at Amazon for the failure ti’s important to remember that cloud computing is still only in it’s infancy, this mad rush to adopt it is part and parcel of the reason these problems are happening.  Customers rushing for a new product creates demand, companies looking to be the first to capitalize on that demand create a product that may or may not be ready for prime time.

But because no-one ever (because it’s impossible) thought to test the kind of cascade failure they experienced, they were pushing the high-availability envelope right out of the gate.

So no big deal, right?  Foursquare, parts of netflix, etc. were down due to the outage.  Other than inconvenience and the inability of narcissistic people to let the world know where they are and what they’re doing, it’s not really that big a deal (for us)

And then this came out:

Specifically this line:

“We are a monitoring company and are monitoring hundreds of cardiac patients at home.  We were unable to see their ECG signals since 21st of April.”

Really?  You have a life-critical application and you hosted it “in the cloud”?  Did it never occur to you that it’s probably *NOT* a good place for a life-or-death application?  While I would consider it as a backup, definitely not my one and only.

People who know me know I have a rule.  I don’t say it works until I’ve seen it work at least once, and even then I’ll qualify my statement with “well I saw it work under THESE conditions.”  I do *NOT* say something works based on what some sales or marketing person tells me works.  (Trust me, this has been a major sticking point between me and my sales team. 😉

That being said.  You have to accept that if you put your critical apps in “the cloud” by it’s very nature you are abdicating your control over it, and putting your full faith in someone ELSE to fix the problem.  Someone who may not think your application is as important as the one in the rack next to yours.

Are you going to take someone’s word that something is “Highly Available” if you haven’t actually pulled the plug yourself and watched it fail over?  I won’t.  I will candidly couch my answer in “That’s the way it’s supposed to work” or “That’s the way it’s designed to work”  But until you see a failover, that’s not the way it DOES work, because it never has.

I run my own email, my own webserver, my own infrastructure. I prefer it this way, because now if the system goes down, I know exactly whose butt to kick.

As a rule, and If I’m paying someone else to provide a service… I make sure I know where, how, and who to call when it blows up.  It’s probably the best advise I can give.

Amazon billed this as being “highly avaialble” and maybe it is, for the most part.  But obviously if you think of a million ways for something to go wrong, you can bet even money on their being at least a million and one ways for it to fail.

Instead of EC2, they should have named it “Titanic” because everyone knows the easiest way to invite disaster is to tell the world you’re immune to it.


    • ciwei on May 4, 2011 at 3:21 pm
    • Reply

    “I run my own email, my own webserver, my own infrastructure. … ”

    For a startup , running your critical application on the cloud, vs ruining on your own Data Center:

    1. it might be much cheaper. for a startup.
    2. it might be provide much better availability, then roll you own.

    1. Most of the companies involved were not start-ups…Not by a long-shot.

      it *STILL* comes down to this: “If I run my own hardware, at least I know whose ass to kick when things go down.”

      I also know that there won’t be a million OTHER people kicking said ass. When you’re sharing an infrastructure with a thousand other people, odds are yours isn’t going to be the first one to be brought back up, by at least a thousand to one.

Leave a Reply

Your email address will not be published.