When the links go bouncing…

In a true DR environment where Synchronous replication is used, it’s best to have two routes from source to target, or at the very least a switched route that can dynamically re-route in semi-real-time.

Everyone knows the story.  The link is up, everything is good, source ack’s a write to the host when the target acks it.  The link is down, replication is halted, source ack’s to the host when write is committed to cache on the source.

(Or, in this case, you have two optical routes but somehow managed to put it all through the same DWDM tray, which then failed, taking out both routes)

But i’ve seen it happen more often than not.  The “Bouncing” link.  Up, down, up down up down etc etc etc..

Very few storage systems handle that well.  Mostly because when the link is half-way there the system gets torn between the requirement (in synchronous replication) to acknowledge the link.

The good news is most host operating systems handle it wonderfully.  Sun records such events as “Retryable disk errors”, Windows and AIX I don’t think even report it.

Enter RedHat Linux, or in this case, RHEV.  RHEV uses a standard lvm2 volume group with virtual disks as logical volumes within the volume group.  Simple enough right?

Well what if you have disks from different disk subsystems?  What if  you have some mirrored and some not (the usual reason for that would be test/dev and production in the same environment. (Though putting dev/test and production in the same cluster is kinda nutty)

The situation I just saw was this.  4x 500G volumes, only ONE of them mirrored.   RHEV apparently put them all in the same volume group.

You *NEVER* put mirrored and non-mirrored volumes in the same volume group.  If for no other reason than the disk on the target array is USELESS without it’s partner disks.

In this case we had one disk out of 4 that was dropping on and off-line, some admin gets the idea to reboot the host – which of course attempts to close the volume group.  When it can’t flush those writes to disk the behavior gets a little unpredictable.  Most likely the shutdown will hang, causing some overzealous admin to go hit the power-switch…

Data loss ensues because there are cached-writes that haven’t been committed.

And they call me for help with it.  meanwhile, the freeware VMWare ESXi environment, that is also replicated, and that *I* have been pushing hard for enterprise-wide adoption of, blows right through the 36 hours of random problems with not even a sigh.

The problem with calling me for help with it, is I can just SMELL someone trying to blame the data-loss on EMC, and I want NOTHING to do with it.  So I tell them to open up a support ticket with RedHat.

Oops, they didn’t buy support.  Apparently when you throw in support the cost-benefit analysis vs. VMWare that makes it too expensive..

FML

I worked for 18 straight hours on Friday.

11 comments

Skip to comment form

    • dim on July 7, 2010 at 5:56 am
    • Reply

    You say this as if Red Hat is to blame, the person responsible for this misconfig is whoever decided to push 4 different LUNs into a single VG.

    1. Touche – as well as the guy who decided to simply hit the power button rather than actually troubleshoot the problem.

      The way I look at it, and from what i read, RHEV doesn’t really give you the option of what to put where, does it? can someone answer that question? I mean with VMWare you simply don’t stripe VMFS across luns, you create the luns to be the size you need them, and that’s that. So if/when there is a crash, there is no risk of any issues with consistency.

        • dim on July 7, 2010 at 2:58 pm
        • Reply

        Well, there are different ways of working with the storage domains. You have to understand what you are doing of course
        Since a storage domain (SD) is, on the storage level, a VG, you can expand it with relevant LUNs (which are in effect PVs) thus making it larger. Here is where the mistake took place actually.
        On the other hand, you can always add new SDs, if the LUNs (PVs) are based on different storage types, and create your VMs with virtual disks in the different SDs, according to the VM’s requirements, e.g. have a NIS server VM on a SD that is based on a slower LUN, and put the database VMs on the faster LUNs.

        As long as all the hosts in the same cluster can access all the relevant LUNs (and they should), this will work.

        I don’t think you can actually “sense”, on the host side, if the LUNs you are presented with are different (might be a nice feature for dm-mp to have I guess — does powerpath do that?) so you have to know your storage when creating a setup of course

        note to self – this is good KB material 🙂

        1. But in RHEV when you create a storage domain you’re not able to “VMotion” a system between domains right? Doesn’t that pose serious limitations? I mean with VMWare it’s easy to move a VM from one storge type to another without downtime, which I’ve found to be invaluable in many instances… (I’ve run test boxes on NFS storage and then migrated it to FC storage when the test system goes into production.)

          Also – since the VG’s are essentially LVM2, how do they handle locking when multiple hypervisors are hitting VM’s? VMFS handles this very well to a point, (I think 16-20 machines is the recomended limit per lun) but I’ve never known LVM2 to be particularly tolerant of block-level sharing…

          So the mistake here was when the admin threw all storage into the same domain and didn’t bother to check on mirroring. (I have not been allowed to have any kind of real view into the system, so I don’t know, I just handed them the storage they requested.)

            • dim on July 7, 2010 at 5:11 pm

            >But in RHEV when you create a storage domain you’re not
            >able to “VMotion” a system between domains right?
            You probably mean “storage vmotion” – that’s currently an offline only feature (so you can move VMs that aren’t running), but qemu has it in the upstream, so RHEV will as well soon enough

            >Also – since the VG’s are essentially LVM2, how do
            >they handle locking when multiple hypervisors are
            >hitting VM’s?
            that’s what the SPM is for, and why it’s so important.

            Not having an FS makes RHEV storage handling much faster in the long run, and also removes the host number limitation (seen a cluster of ~70 hosts, and that’s only because I ran out of machines to add)

            >So the mistake here was when the admin threw all
            >storage into the same domain and didn’t bother to
            >check on mirroring.
            yup, not a “best practice” in anyone’s book

            • brerrabbit on July 13, 2010 at 4:12 pm

            Just so you know, the effective limit on the number of hosts in an ESX cluster for 4.0 (haven’t checked 4.1 since it just came out this morning) is 256 – # number of LUNs you intend to present to the hosts.

            In other words, in 4.0 there’s a hard limit of 256 LUNs per cluster, and each host brings a LUN into the count….ESX has a local VMFS LUN where the serv console runs, and I suspect that there’s a hidden VMFS partition on ESXi, while probably tiny, still counts toward the limit

  1. Unfortunately technology can hardly ever fully cover a sysadmin mistake. I don’t know of any volume manager that will prevent an admin from adding different types of luns into a volume group (vxvm, ZFS, lvm, etc).

    If you want to play dangerous, you can also have a vmfs filesystem span across luns as well. Generally against best practices (but it’s there if you really want to do it) and it doesn’t *have* to be the same protection type. You do have to intentionally want to do it and know how to do it… so much, much less likely to happen (but possible).

    As to the claim of avoiding the filesystem from a performance perspective… not completely sure on that claim. We too have had a very large single cluster of ESX servers (52x nodes) running in production… and yes it was very much out of spec (not my idea but ran for year+ that way). There are certain times when locks happen that have an affect on performance within VMFS: create vmdk, metadata change (i.e. grow vmdk or snapshot), poweron/off guest or host. There are certain similar times when lvm2 cluster type locks also occur. LVM2 snapshots growing don’t have the same lock affects but other operations do.

    I haven’t run a RHEV system, but having tried to run a 90TB gfs2 filesystem under RHEL 5… I have to say had nothing but problems with clvm (patch on top of patch on top of patch and still nothing but hanging problems) some rather painful as well where it required a full cluster outage. VMFS I’ve found just works… additionally if you are that twitchy about the filesystem being an issue then you can do RDM and avoid filesystems completely (but that is damn icky for management) and use raw luns for guests.

      • dim on July 10, 2010 at 1:29 am
      • Reply

      What does clvm have to do with RHEV?

      1. My understanding of RHEV with block storage, it generally uses lvm volumes to carve & present the storage. I could be wrong (as there is very limited docs on the tech behind the RHEV management layer… assuming it’s similar tech if you weren’t using their management) but I believe RHEV uses RHCS for mgmt of clustering Xen and now KVM with RHEL6, at least that’s the pitch from our RHEL sales reps over the years. Since RHCS has pretty much always used clvm from their Sistina purchase years ago for clustered logical volumes without contradictory info, I’m assuming they are continuing to use it here with RHEV clusters (there are a number of guides on making XEN Linux clusters with clvm and/or GFS).

          • dim on July 11, 2010 at 1:26 am
          • Reply

          Well, you’re in for a surprise – no clvm in there, plane lvm2 instead, with an extra daemon to handle locking and storage management.

          If you are interested in actually seeing the product you might want to contact your sales rep, otherwise it’s just guesses that don’t lead anywhere

          1. Not to be argumentative… but I think we are talking the same thing. Clvm is this: lvm2 with an extra daemon (clvmd) to handle locking and storage. It is basically part of lvm, just open /etc/lvm/lvm.conf on any RedHat box (Advanced Platform to Fedora) and you’ll find references to it. RedHat does separate the two and only includes the clvm daemon in certain channels, so if you want their binary you have pay for it but there’s nothing really special or unique about it.

Leave a Reply

Your email address will not be published.