RPO vs. RTO

I had an engineer friend of mine (real engineer, not affiliated with computers) once told me.

 “There are three options:

     1. You can have it faster.
     2. You can have it smaller.
     3. You can have it cheaper.

….Now pick any two.”

Over and over in my life I’ve put that theory to the test and to this day it has always held true.  The smaller and faster something is the more expensive it gets.  The cheaper something is the more it is slow and less portable.

Disaster Recovery and COOP (Continuation of OPerations – for the layman) follow a lot of the same rules.

There are Three main criteria you’re aiming for.  The main two are RPO and RTO.  That’s “Recovery Point Objective” and “Recovery Time Objective”

The third is, of course, cost.

RPO is defined by the point at which you need to be able to recover to.  Goals are sometimes easy to obtain, “Midnight on the morning of the failure” is usually pretty easily obtainable, as you can do that by restoring from backups.  Financial institutions aim for somewhat stricter objectives.  Most banks will require an RPO of “Zero” meaning “I want to see the last committed transaction on my DR site in the even the source site becomes a smoking hole in the ground.”

This is doable of course, provided the DR site is close enough to the source site to run dark fibre between the two with low enough latency to add negligible impact to production.  (the rule of thumb for synchronous replication is 2ms per 10k, that is for every 10 kilometers you’re adding 2ms of latency.  A normal physical drive has a latency of about 9-14ms, so if you go to far you’re going to slow your system to a crawl.

RTO is defined as “how long can I afford to have my environment down to affect a failover.”  I’ve worked in one environment where transaction logs were backed up to tape and shipped across country from the L.A. area to Orlando, Florida, where the tapes were then restored into a standby system.  The recovery time to a 15 minute increment was effectively days, because they actually had to wait for the last tape to make it to the target site before they could restore it and bring the system on-line.  It was Insane.

Your goal is to get RPO and RTO to as close to zero as possible without bankrupting the budget (or the company).

An RPO of zero can be obtained with a DR site within about 10 kilometers, 20 if you can live with the slower response times in production.  This is full synchronous transfer from one array to another, every write from host to disk has to be acknowledged by the REMOTE array before it is reported to the host that the write is committed.

EMC’s SRDF/A and SRDF/AR mitigate that in environments where the DR site is far enough away as to kill any chance of SRDF/Syncronous working. 

SRDF/A is a “packetized” SRDF, where the receiving Symm has to receive two consecutive “checkpoints” before it commits the block of data.  That way if an incomplete block is received, it’s discarded to prevent data corruption resulting from incomplete write information.  The downside to SRDF/A of course it that it requires an insane amount of cache to function properly.  (And don’t let an SE tell you it doesn’t, he’s lying or not capable of understanding that for the remote Symm to receive a block of data, it has to be able to store it somewhere other than disk until it receives two checkpoints.

SRDF/AR is an automated replication product.  You are essentially mirroring production to a TimeFinder BCV, which is then sent synchronously to the remote site.  You can run a Sync transfer because the BCV’s are not connected to the production volumes, and as such the production volumes do not require any ACK/NAK from the remote system.  Depending on the time it takes to replicate (how fast the pipe is between the two sites) you can get RTO to about 10 minutes, which is good enough for most.  The effects of SRDF/AR can be duplicated by anyone proficient in Korn shell, as it literally runs a series of waits and whiles for each stage of the process.  AR has the added bonus that you can actually keep a second set of BCV’s on the target host and run your backups from them.  The down-side to the AR type of scenario (whether it be SRDF/AR or a scripted set-up) is that it costs disks – and lots of them.  There are the production volumes, mirrored, the first set of BCV’s, unprotected, the SRDF target devices (Mirrored or Raid5) and the second set of BCV’s.

Scary huh?

As I prepare to start my own replciation design this was formost on my mind, which is how it ended up here.  (this is after all the dumping ground for my random thoughts)

4 comments

Skip to comment form

    • on November 1, 2006 at 7:39 am
    • Reply

    Jesse,
    I see you were at MTI in 2000. I was there from 2000-2004. Drop me a line.
    jrmckins@yahoo.com
    Jim

  1. Yeah, MTI was a sucking waste of a year of my life. Only to get canned when I objected to a less-than-perfect design sold into a customer in Vegas. When it finally did blow up in their faces, they didn’t like the “I told ya so” so I was out in the next group of layoffs.

    Don’t mind it too much though, I got hired by an EMC Partner a couple of weeks later, and that’s when the *REAL* work started.

    • on November 7, 2006 at 2:44 pm
    • Reply

    Haha

    What *real* work is that? Irritating problems, long hours, endless support calls, doesn’t do what it says on the package type stuff 😉

    Just kidding of course – I couldn’t resist the joke though.

    I like your view on the two of three options.

    Mackem

  2. Yeah – most of my work was “Hey sales told me it would even make coffee”

    You have two options in those situations. You break the news to them gently, or you enable the coffee-making subroutine. 😉

    That was where I learned that MTI wasn’t alone. It’s a habit of sales people, to say anything they have to say to make the sale, regardless of whether or not it was true.

    I got pulled into a movie studio in Burbank that will remain nameless (anyone who knows me knows which one it was) It was supposed to be six weeks of TimeFinder / DB2 scripting. Turns out they were running SAP over DB2 so the backups weren’t working. When digging it turns out that DB2/SAP/TimeFinder had at the point never been supported, because SAP configured DB2 in an odd way that prevented you from freezing all I/O on the database.

    18 months later I was still there and ended up having to quit the consulting firm I was working for because they wouldn’t let me out.

    *THAT* was fun.

Leave a Reply

Your email address will not be published.