«

»

May 02

Still bored…

The painful part is I have a lot to do, but it’s mindless drivel that doesn’t stimulate the geek in me nearly enough.

Interesting thing from this week though – I found that I tend to over-engineer clariion designs.  I guess such a large part of me wishes I was still working on the Symmetrix/SRDF configs of my past that when presented with what turned out to be a simple Clariion config, I spread the raid-groups a little too efficiently, which, according to the second set of eyes I had working with me, can actually cause more problems than it fixes.

Silly me – I assumed that spreading the IO across both spindles and DAE’s was a good thing.  (It used to be, but apparently not so much now)

Someone give me a DMX to play with. 🙂

18 comments

Skip to comment form

  1. You were trying to build raid groups across DAEs right? (because who wouldn’t want to be able to survive the loss of an enclosure)

    But the configuration he suggested was more in line with this (just using 7 disk R5 sets as an example)?

    DAE-0 7xRAID5 7xRAID5 1HS
    DAE-1 7xRAID5 7xRAID5 1HS
    DAE-2 7xRAID5 7xRAID5 1HS
    DAE-3 7xRAID5 7xRAID5 1HS

    With MetaLUNs stripping accross DAEs.

    If so I ran into that the first CLARiiON I did back in ’04. If not, do tell what you proposed vs. what was “correct” so others do it right… 🙂

  2. Jesse

    Ideally that’s what I thought. Perfect world (and I’ve told people this) you would build a raid-group vertically. With 5 DAE’s you could use drive 0 in each DAE and go vertical.

    Do you know the performance impact of 6+1 vs. 4+1? Seems to me there would be a write-hit due to the extra parity calculation involved.

    My usual config is

    DAE0 (4+1)(4+1)(4+1)
    DAE1 (4+1)(4+1)(2+2)(HS)

    (The 2+2 is Raid 1/0)

    This gets me decent performing Raid-grops plus a Raid-10 set for transaction logs and the like.

  3. I’ve never seen a direct comparison of 4+1 vs 6+1 for CLARiiONs, but the usual recommendations are:

    – RAID1 with each drive in separate DAEs
    – RAID10 with 4, 6 or 8 disks (where 4 is best performance) in the same DAE
    – RAID5 with 3, 5, or 7 disks (where 5 is the best balance of performance vs. space) in the same DAE

    Then for R5 & R10 use stripped MetaLUNs to span raid groups, and throw in a hot spare for every 2 enclosures.

    What you had proposed is perfect (assuming your using MetaLUNs to span the raid groups). Although on a large and busy CLARiiON array I’ve seen disks 0-4 (FLARE) slow things up a bit and have seen people exclude those and put a single RAID-5 set on those that they barely use.

    Then again I’ve seen people who’ve created a single RAID-5 group that spanned two DAEs and was 15+1 (16x300GB disks!) – performance was “limp” and it was a disaster waiting to happen, can you imagine how long the rebuild would take… and the parity calculations, yikes, the SP’s “head” must have been swimming. Sufficient to say switching it to 3x 4+1 improved I/O performance even with the loss of a physical spindle.

    I assume this is a non-issue on a Symm? Now for the question I just have to ask, can LUNs span cabinets?

  4. Jesse

    When I was working Loan-To-Learn, I think I had done the SATA drives that I used for the veritas dumparea as 6+1 in each, but didn’t use Metavolumes (didn’t know they existed at the time.)

    Yes – I’ve definitely heard of instances where putting data on the 0-4 vault drives has caused performance issues. It amazes me that EMC, knowing this is a problem, still utilizes those as such and doesn’t do something like populating those with 18G disks that aren’t intended for actual storage. Then we as engineers wouldn’t be put into the position of having to explain to the customer that they bought all this storage (it’s especially painful when the customer has 300G drives in those slots) but if they use 5 of the disks they run the very real risk of markedly degrading performance.

    You’ll notice that the layout I put above actually results in 5x 4+1 raid-5 groups. This is perfect because you can use RG0 (0_0_0 –> 0_0_4) for static storage or otherwise light-use, and you have 4 remaining 4+1 sets to use for metaluns, plus the 2+2 set for logs.

    Do you think there is a performance benefit to going with metaluns across all four raid-groups (spanning two DAE’s) or should you stick to a single meta lun spanning two raid groups, and either keeping them inside a DAE or spanning?

    As to your last question – in the DMX1000-3000 (DMX1 and DMX2) series it was never an issue. The hypervolumes (luns) were broken up according to the internal algorithm, guarnateeing that both members of a Raid-1 mirror were as far as possible from each other (physically and logically), and ensuring that the back-end rules were maintained. (IE No single point-of-failure)

    When you create a metavolume on a symm, simply using consecutive hypervolumes guarantees wides possible distribution across the back-end. And since you have multiple processors, multiple channels, and multiple paths to disk, you have both maximum throughput and reliability.

    The DMX3 and DMX4 throws a few new wrinkles into it. They are very “clariion-like” in their back-end, as they use something very similar to the DAE’s the clariion uses (only mounted vertically) however that is largely where the similarities to the Clariion end. There are still numerous reundant paths to the storage, multiple processors, and a whole bucketload of cache to deal with, which is why, from the host standpoint, the symm will always outperform the Clariion in day-to-day operation.

    (I’ve seen lab-environment tests that got a Clariion to out-perform a Symm on certain kinds of IO, but the IO profile that was used was something you’d never see in real life.)

  5. I would say performance *could* benefit from spanning the two DAEs But there are some important factors. If you wanted to use all 4 raid groups in your example with a single host you could decrease performance (assuming you weren’t doing it for disk space reasons but for performance) because a MetaLUN is only accessed through the SP that the base LUN is currently trespassed to.

    If you had created two MetaLUNs and assigned each base LUN to a different SP you would prevent that SP from becoming a bottleneck on the array. Also since the usual cabling of DAEs is to alternate across back-end loops (and assuming your not using a CX3-10 or -20 which doesn’t have two loops per SP) the most you would ever benefit from is spanning 2 DAEs.

    Some times it can make more sense to keep the MetaLUNs within the enclosure and present the two MetaLUNs (one on each DAE) to the host and then have the host do the stripped volume management itself. Or i guess you could go nuts and double everything to 4 DAEs, configure MetaLUNs within each DAE, present half through each SP and have the host combine all 4… but that would be somewhat confusing for most humans (and a bitch to setup).

    I guess it would all depend on the situation. 🙂

  6. Jesse

    Not quite, because only those luns have to be owned by a certain SP.

    So the following scenario could be true:

    ML-01 (consisting of)
    LUN0 (RG0)
    LUN1 (RG1)
    LUN2 (RG2)
    LUN3 (RG3)

    Could be owned by SPA

    ML02 (consisting of)
    LUN4(RG0)
    LUN5(RG1)
    LUN6(RG2)
    LUN7(RG3)

    Could be owned by SPB.

    Of course this only applies to FC disks, because they are dual-ported. ATA disks are single-ported so you run into a situation where all of the physical disks in a raid-group must be owned by the same SP, and all luns on that raid-group must also be owned by the same SP, otherwise performance sucks.

  7. Thanks for the DMX details. I had seen how the newer ones were using DAE-like modules and was curious what the consequences were. Yes 64GB of cache (or more) is always nice.

    As for the FLARE drives in the CLARIiONs, I don’t see why they can’t just throw a pair of solid state disks in each of the SPs and be done with it. Heck it would save 5 slots, improve the response times and you wouldn’t have to worry about sacrificing the vault cache for more disks on small systems.

  8. You’re absolutely right, I was over-thinking things (its contagious).

    Do it that way and then have the host strip across the MetaLUNs with its volume management software. All that for what will probably amount to a minor increase in throughput and I/Os… but at least you’d be fully utilizing the array. 🙂

  9. Jesse

    Actually it’s funny you mention the SSD disk on a Clariion. They used to use a 2.5″ laptop drive on the Clariion SP. (I believe the 4700 and previous) and that’s where all of the clariion code was stored. The FC5300 was the first one to use the “Vault” drive (although it was only the first drive in the array at the time) but I think that was only because the raid controller on the 5300 was a low-profile card that fit into the slot usually occupied by an LCC card.

    The problem with over-engineering the Clariion, is that the busses aren’t the bottlenecks on a Clariion. When it comes down to it all communications are serial because there is still only one internal processor (regardless of the number of cores that processor might have) and bus. That means that all IO, by design, is serial.

    Now I won’t even get into the fact that it’s running Windows Embedded…..

  10. Jesse

    Oh – and the last thing going back to the internal disks on the Clariion would do—it would make RaidGroup0=RaidGroup1 in size…

    The most annoying thing is to go about creating a MetaLun and find you’re short on space in RG0 due to the vault drives.

  11. Don’t we all wish someone would give us nice shiny toys to play with 🙂

    Myself, I’d like both a DMX and a VI3 setup to go with it…

  12. Oddly enough, I have a DMX that I’m running a rather large VI3 buildout on….It is a nice toy–But stressful. You guys know what I mean I’m sure.

  13. Jesse

    I have enough trouble keeping my 4 Dell 2650’s cool in my little 20×40 office/computer room. I think throwing a DMX into the mix would be fatal. (not to mention my wife would divorce me.) 🙂

  14. Interesting post and comments. I come from the NetApp world where 16 drive raid groups are recommended (RAID 6 though). So I’m curious why there would be a performance hit from having a large raid group such as a 15+1 vs. four 4+1 in terms of I/O performance as stated by Andrew above.

    I understand about the performance degradation during rebuild as well as the risk of another drive failure during the process.

    BTW I now look after a Clariion and a DMX and the clariion seems to be easier to grasp and understand. The DMX is a black box so far outside of the SMC 🙁

  15. Jesse

    Because – and this is true with NetApp as well – when you stripe that wide – 14+1 or 13+2 in a 15 drive chassis, every write has to be proceeded by x-1 reads in order to compute the parity. Now granted most arrays do this parity calculation in cache long before it hits the disks, And systems with a real predictive read-ahead, like the Symm, will just pull the calculations from cache instead of having to go back to disk for data that is already there.

    15 drive raid groups do offer the benefit of screaming-fast reads….Since the storage array (provided it has the CPU cycles and programming to do it) can, in the case of long, sequential, reads, pre-position more drives to allow the controller to stream data off the disks. It’s *WONDERFUL* for archival data and other data that is read 80% more often than it is written to.

  16. Ah, ok that’s what I thought you guys were referring to. That penalty is there for any type of parity RAID. While this is usually bad in the software based RAIDs as you pointed out with the Clariions, Symms and NetApps most of the writes are happening in the cache, then they’re coalesced and laid down on the disks in even stripes as much as possible.

    This is also where NetApp is a bit different and more efficient. Since it owns the filesystem too it handles things a bit better. It keeps a bitmap of the the blocks in it’s memory and unless the filesystem is heavily fragmented it’ll try to lay down full stripes as much as possible. Even for re-writes it doesn’t modify the existing blocks, it writes a new stripe on the free blocks then adds the old blocks to it’s free-block list. Therefore it’s important to have 10-15% free space on the filesystem for the NetApps. However, as they get full it loses this advantage.

  17. Jesse

    Yes – but unless you have literally TONS of cache (which is how the Symmetrix gets away with having performance numbers almost identical between R5 and R1) your back-end speed directly affects the speed at which the cache is de-staged to disk. And when you’re talking about a system like a CX300, which really does have minimal cache onboard, you will see the performance hit.

  18. No experience with the smaller boxes but on my cx3-40 (clariion cache turned off and running IOZONE) I saw minimal difference. The LUNs I tested were 16 disk Meta (4 x 4+1 R5) and one 8 disk Meta. I figure a Meta Lun is the same as having a large raid group, only difference being the Metas provide added insurance against multiple drive failures.

    With two camps one that says the more the number of spindles the better the overall performance and the other camp that says the parity based RAIDs will suffer from large RAID groups I expected to see a big difference – though the host buffer cache may have something to do with the skewd results.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>