In a true DR environment where Synchronous replication is used, it’s best to have two routes from source to target, or at the very least a switched route that can dynamically re-route in semi-real-time.
Everyone knows the story. The link is up, everything is good, source ack’s a write to the host when the target acks it. The link is down, replication is halted, source ack’s to the host when write is committed to cache on the source.
(Or, in this case, you have two optical routes but somehow managed to put it all through the same DWDM tray, which then failed, taking out both routes)
But i’ve seen it happen more often than not. The “Bouncing” link. Up, down, up down up down etc etc etc..
Very few storage systems handle that well. Mostly because when the link is half-way there the system gets torn between the requirement (in synchronous replication) to acknowledge the link.
The good news is most host operating systems handle it wonderfully. Sun records such events as “Retryable disk errors”, Windows and AIX I don’t think even report it.
Enter RedHat Linux, or in this case, RHEV. RHEV uses a standard lvm2 volume group with virtual disks as logical volumes within the volume group. Simple enough right?
Well what if you have disks from different disk subsystems? What if you have some mirrored and some not (the usual reason for that would be test/dev and production in the same environment. (Though putting dev/test and production in the same cluster is kinda nutty)
The situation I just saw was this. 4x 500G volumes, only ONE of them mirrored. RHEV apparently put them all in the same volume group.
You *NEVER* put mirrored and non-mirrored volumes in the same volume group. If for no other reason than the disk on the target array is USELESS without it’s partner disks.
In this case we had one disk out of 4 that was dropping on and off-line, some admin gets the idea to reboot the host – which of course attempts to close the volume group. When it can’t flush those writes to disk the behavior gets a little unpredictable. Most likely the shutdown will hang, causing some overzealous admin to go hit the power-switch…
Data loss ensues because there are cached-writes that haven’t been committed.
And they call me for help with it. meanwhile, the freeware VMWare ESXi environment, that is also replicated, and that *I* have been pushing hard for enterprise-wide adoption of, blows right through the 36 hours of random problems with not even a sigh.
The problem with calling me for help with it, is I can just SMELL someone trying to blame the data-loss on EMC, and I want NOTHING to do with it. So I tell them to open up a support ticket with RedHat.
Oops, they didn’t buy support. Apparently when you throw in support the cost-benefit analysis vs. VMWare that makes it too expensive..
I worked for 18 straight hours on Friday.