Internal Celerra Migration

I got sucked into this job, and the only benefit of it is that it’s in California, which, when it comes down to it is not a bad place to be when it’s snowing back home. Bygones.

Anyway, my job, whether I want to or not, is to figure out a way to move about 4 Terabytes of Celerra data from one set of disks (6+1 R5 -7.2KSATA) to newer, faster disks (4+1R5-15KFC)

And the rub is, that I have to do it online.

This is one of those places where I hate the celerra. Found this great primus article, (emc144545 if you’re interested) that states quite unequivocally that you can only use the back-end Clariion lun migration if you are migrating to an identical raid group, which to me, negates the reasoning for doing it in the first place.

Identical raid group. If it’s SATA, the target has to be SATA. If it’s 4+1 Raid-5 the target MUST be 4+1 Raid-5.

Near as I can figure, and this is not stated clearly in the article, that it has to do with how the Celerra builds it’s raid-pools. Since you USUALLY build filesystems and set them to expand into a raid-pool, my guess is that changing the make-up of the disks underneath the filesystem screws up the pool database.

Come on guys, this should be an easy fix. (This and the ability to easily shrink a filesystem would be nice) When a customer makes one of those mistakes, you know, buying the wrong disks from the outset because they’re focused on capacity and forget a little thing called performance, the hardware should offer an easy way to fix this.

In the case of the customer I’m working on now, Clariion LUN migration was out because of the disk-mark issue, the standard SecureCopy is out because minimal downtime is allowed.

Long and the short of it is I’m getting ready to do an internal CDMS migration. Now anyone who has used CDMS knows it’s not the fastest product in the world. You also know it can be maddening because one of the things you *STILL* can’t see is what percentage complete the migration is.

But as far as technology goes, the usefulness of it is awe inspiring.

CDMS is a “Copy-On-Access” file-level clone. Essentially it builds a duplicate I-Node table pointing to the old files, and presents this to the client. Browsing the new directory structure shows you all of the filesystem structure exactly as it is on the old source box. WHen you attempt the access a file for the first time, it then copies that file from the old filesystem to the new file system and then passes it to the end-client.

Now this is where it’s a pain. it’s a slow process, depending on the speed of the source system/network/etc you can increase your initial access time 20-fold. (subsequent accesses come from the new disks, so it’s a one-time-hit.)

So tomorrow, I start moving almost 4TB this way.  Running 32 threads internally (this is inter-Celerra migration) it should run fairly fast, depending on how fast the network stack can process it.

To EMC – fix the disk-mark database to allow a celerra lun to be migrated without peanalty (or at least with easy-to-moderate reconfiguration)  You’ll sell more disks because people won’t feel married to the disks they’ve got, or worry about committing to a disk-type if they’re not absolutely sure of the perofrmance numbers of it.

To all Sales people.  Don’t sell SATA disks for production-level applications.  They don’t work.  (see my next post, SATA, SAS, and Fibrechannel)

10 comments

Skip to comment form

  1. Man I’m long-winded when I’m tired and annoyed.

    • Sean Cummins on February 3, 2009 at 1:56 pm
    • Reply

    Have you thought about Celerra Replicator? You could create a new filesystem within the target AVM pool, then configure loopback replication from your old filesystem to the new fs… once you’ve completed the initial full sync and started replication, it’s continuous async, so downtime is limited to the amount of time required to flush out the remaining deltas after you’ve begun your outage window, plus cutover time… e.g. start your downtime window by cutting off access to the production filesystem, wait for deltas to flush to the target, cutover to the target, then resume production activity on the target fs.

  2. That is a most excellent idea – and I hate you for having it this week instead of last week. 🙂 (Truthfully, the fact that I didn’t post this until about 24 hours ago makes it squarely my fault. More than anything I am annoyed with myself for not thinking of it, because the minute I read your comment I started banging my head against the wall with how obvious it was a solution.

    However – the CDMS was the way we had to go, because the customer storage admin and I had already spent quite a bit of time selling it to his management.

    I was worried at first, nothing seemed to be moving as quickly as I would like it. A little troubleshooting eliminated a few bizarre issues after which the migration took off and totally screamed.

    Issue #1 – when you’re mapping NFS, make sure you do it via the command line so you can add the option “proto=TCP” The Celerra defaults to UDP, which as we all know, well….underperforms.

    Issue #2 – Random Weirdness with the DataMover. I started getting random I/O errors issuing commands, never seen an I/O error when issuing commands to mount/unmount filesystems. Since we had all users off all filesystems, I was able to drop a hammer on it and reboot the datamover.

    Issue #3 – Misc. typos in the script I used to start everything off. Funny how the further down I got in the script the more typos I found. Amazing thing being tired.

    (Can you believe the spell checker didn’t flag the word “Weirdness”? – Strange)

    Anyway, the first 7 of 12 filesystems finished within two hours of starting, averaging 30-40G/hr, (rough estimate) The last two filesystems were 350G and 2.4TB respectively, I fully expect them to take forever, especially since the larger of the two is made up almost entirely of under 100K files.

    Fun stuff.

  3. I’m in the process right now of moving file systems only used for cifs access from fiber to SATA drives. Right now we are running 5.5 code. I’m just using fs_copy then renaming the underlining file systems and remounting it to the old mount point so the cifs shares stay intact.
    What was your client using the SATA drives for that was too slow?

  4. fs_copy will duplicate a filesystem, but it doesn’t do the rsync style twinning required to do a largely online move. This move involved about 5TB of data and we had an outage window of 2 hours to do the move.

    Without divulging customer information, their application scrapes through the web looking for specific information. They also do data-collection for legal clients and such.

    The application was slow for a two reasons.

    1. When the Celerra was initially installed by EMC PS, the Read Cache was never enabled.

    2. The LUNS within a raid-group were owned by both SP’s. That’s a big no-no in the SATA world.

    Fixing these two things improved performance 100% right there. Moving from the SATA to FC disks has made an additional huge difference probably more because there are more FC spindles than SATA..

    • Sean Cummins on February 13, 2009 at 8:12 am
    • Reply

    Just FYI, the requirement that all LUNs within a SATA RAID Group must be owned by the same SP applied to the first generation of SATA drives in DAE-ATA trays… the SATA II drives in UltraPoint DAEs do not have this limitation anymore. UltraPoint was introduced in the CX3 generation; so with CX3 and newer, you can balance LUNs across SPs in SATA RAID Groups with no adverse effects.. I’m guessing you were working with an older CX in this case.

    1. Nope – CX3-40f….

      Interesting – I wasn’t able to find anything in powerlink that suggested that that requirement had been removed.

      I also can’t argue with the 100+% performance improvement we saw after I dedicated the IO within a raid-group. (Backup time went from 2 days to 18 hours)

      I seriously doubt that that could have all been the read-cache, 384Megs isn’t enough to make that much of a difference.

    • Sean Cummins on February 13, 2009 at 3:46 pm
    • Reply

    Enabling read cache also enables the ability for the CX to prefetch.. and if your backups are generating large block sequential reads, then prefetch can result in that degree of performance improvement..

    As for the SATAII / UltraPoint thing.. Couldn’t any docs where this is glaringly obvious, but if you look at the Clariion Best Practices for Fibre Channel PDFs on Powerlink, search for “DAE-ATA” within the doc and you’ll see a brief note with a recommendation to keep LUNs owned by a single SP on DAE-ATA RAID Groups. The omission of UltraPoint DAEs in this note implies that the restriction is only applicable to DAE-ATA trays (which is true).

    • Joe on February 24, 2009 at 8:39 am
    • Reply

    I am doing a old integrated Celerra NS50x to a newer Celerra NS120 all CiFS shares migration.
    Do you recommend the use of CDMS for this — In my case the new NS120 will have different IP addresses that the source NS. Obviously you used the same IP address on the same Celerra — do you see issue with me using CDMS in my environment–sorry for any spelling errors

    1. Hey Joe – haven’t heard from you in a while.

      Actually there are two ways of doing it and it mostly depends on when you can take the outage. If you are changing IP’s anyway you might be better off with Celerra Replicator. Snap the filesystem, replicate the snap to the new box, and when you’ve got most of the data over, you take the host down, do a final snap-and-push.

      The least disruptive would be CDMS – but there are a few gotchas. Small files work wonderfully, when you start looking at doing LARGE files you might run into timeouts. Since the copy has to complete before the file is presented to the host, a request for access to a large file (gigs or more) can take a while, the time it takes to copy the file from the old NS to the new NS before it can be presented out. (had that one bite me using CDMS to migrate a share that was being used as a Backup-Exec dumparea.)

      Also – big also. WHen you create a CDMS migration it will default to UDP, which sucks wind. If you do it via the command line without the Protocol=TCP option it will also default to UDP. Make sure you specify TCP because your connect speeds and migration times are MUCH better.

      Give me a call if you need anything, you’ve got the number.

Leave a Reply

Your email address will not be published.