I had an engineer friend of mine (real engineer, not affiliated with computers) once told me.
Â “There are three options:
Â Â Â Â 1. You can have it faster.
Â Â Â Â 2. You can have it smaller.
Â Â Â Â 3.Â You can have it cheaper.
….Now pick any two.”
Over and over in my life I’ve put that theory to the test and to this day it has always held true.Â The smaller and faster something is the more expensive it gets.Â The cheaper something is the more it is slow and less portable.
Disaster Recovery and COOP (Continuation of OPerations – for the layman) follow a lot of the same rules.
There are Three main criteria you’re aiming for.Â The main two are RPO and RTO.Â That’s “Recovery Point Objective” and “Recovery Time Objective”
The third is, of course, cost.
RPO is defined by the point at which you need to be able to recover to.Â Goals are sometimes easy to obtain, “Midnight on the morning of the failure” is usually pretty easily obtainable, as you can do that by restoring from backups.Â Financial institutions aim for somewhat stricter objectives.Â Most banks will require an RPO of “Zero” meaning “I want to see the last committed transaction on my DR site in the even the source site becomes a smoking hole in the ground.”
This is doable of course, provided the DR site is close enough to the source site to run dark fibre between the two with low enough latency to add negligible impact to production.Â (the rule of thumb for synchronous replication is 2ms per 10k, that is for every 10 kilometers you’re adding 2ms of latency.Â A normal physical drive has a latency of about 9-14ms, so if you go to far you’re going to slow your system to a crawl.
RTO is defined as “how long can I afford to have my environment down to affect a failover.”Â I’ve worked in one environment where transaction logs were backed up to tape and shipped across country from the L.A. area to Orlando, Florida, where the tapes were then restored into a standby system.Â The recovery time to a 15 minuteÂ increment was effectivelyÂ days, because they actually had to wait for the last tape to make it to the target site before they could restore it and bring the system on-line.Â It was Insane.
Your goal is to get RPO and RTO to as close to zero as possible without bankrupting the budget (or the company).
An RPO of zero can be obtained with a DR site within about 10 kilometers, 20 if you can live with the slower response times in production.Â This is full synchronous transfer from one array to another, every write from host to disk has to be acknowledged by the REMOTE array before it is reported to the host that the write is committed.
EMC’s SRDF/A and SRDF/AR mitigate that in environments where the DR site is far enough away as toÂ kill any chance ofÂ SRDF/Syncronous working.Â
SRDF/A is a “packetized” SRDF, where the receiving Symm has to receive two consecutive “checkpoints” before it commits the block of data.Â That way if an incomplete block is received, it’s discarded to prevent data corruption resulting from incomplete write information.Â The downside to SRDF/A of course it that it requires an insane amount of cache to function properly.Â (And don’t let an SE tell you it doesn’t, he’s lying or not capable of understanding that for the remote Symm to receive a block of data, it has to be able to store it somewhere other than disk until it receives two checkpoints.
SRDF/AR is an automated replication product.Â You are essentially mirroring production to a TimeFinder BCV, which is then sent synchronously to the remote site.Â You can run a Sync transfer because the BCV’s are not connected to the production volumes, and as such the production volumes do not require any ACK/NAK from the remote system.Â Depending on the time it takes to replicate (how fast the pipe is between the two sites) you can get RTO to about 10 minutes, which is good enough for most.Â The effects of SRDF/AR can be duplicated by anyone proficient in Korn shell, as it literally runs a series of waits and whiles for each stage of the process.Â AR has the added bonus that you can actually keep a second set of BCV’s on the target host and run your backups from them.Â The down-side to the AR type of scenario (whether it be SRDF/AR or a scripted set-up) is that it costs disks – and lots of them.Â There are the production volumes, mirrored, the first set of BCV’s, unprotected, the SRDF target devices (Mirrored or Raid5) and the second set of BCV’s.
As I prepare to start my own replciation design this was formost on my mind, which is how it ended up here.Â (this is after all the dumping ground for my random thoughts)