Cisco dial-home follow-up!

Cisco has released version 3.0(2) of their firmware, and with this, comes, finally, an EMAIL-HOME feature for the switch.  It’s not perfect (because unlike the dial-home on the Symm, this is dependent on an external server or two, and doesn’t work if the network connection is down.

The long and the short of it is the Cisco Fabric Manager, which is installed on whatever host you’re going to manage the switch from (as it is with the current versions of the cisco SanOS) monitors the switch via SNMP and when an error condition is detected, sends an email to EMC with the appropriate information.

I want to say it’s about bloody time.  Cisco switches have been the red-headed stepchild of EMC for too long.  it’s about time they supported them with the fervor that they do the rest of their products.

The down-side?  You have to go from version 2.1(x) to 3.0(x) and I’ve not heard back from EMC at this time as to whether this is an online upgrade in anything shy of the MDS95xx switches.  (I’m using MDS9216’s, and an MDS9140, which I suspect need a full reboot to come up on the new firmware rev)

 

EMC and Cisco are some kind of partners.

Ok, there is something I hate about Cisco- and it’s not really about Cisco, it’s about EMC’s complete failure to completely support their MDS series of switches.

First – When you buy an MDS 9xxx switch from Cisco, it comes with the ability to dial into Cisco when there is a problem, much like EMC does with the Symmetrix.  This feature is not available when you buy the switch from EMC.  (IE a Cisco switch with a $10,000 sticker on it that says “EMC”) you don’t get this ability.  In fact, you get no dial-home capability at all. 

I called in to the SAC, talked to our sales team, our former Project Manager who was (sort-of) handling the implentation, etc, and the long and the short of it is that in order to be notified when there is a problem with the switch, you have to set Control Center up to email you, and then you have to call EMC.  (I’ve called them so many times I know the number by heart)

Secondly, (and more to the point of the subject of this email) making a zoning change in ECC requires that you log into the switch and type “copy run start” at the prompt to save the configuration to the start-up.  I’ve known people to go months without doing this – only to find that every zoning change made in the last number of weeks is lost on the reboot of the switch.

And Third – when you do call EMC for support, they have to wait on Cisco to ship the parts.  So technically you’re not even really getting EMC support, you’re getting third party “ghost” support.

This seems to me to be inexcusable considering the number of MDS switches that EMC is putting on the market.

My advice – if you’re going to buy a Cisco fibrechannel switch, and I do recomend them, buy it from Cisco.  Until EMC decides to support them fully, I don’t think it’s worth the convienence of getting everything from one place given what you come up short on.

Enterprise Vault for Exchange

The boys over at Symantec (www.symantec.com) just came by last week and gave us an interesting presentation on Enterprise Vault.  (Not to be confused with the Vault extension for NetBackup, which is a different beast)

The short answer is this.  EV is an application that dives into your exchange environment and strips out any email/attachments over (x) days old.  It then creates an HTML view copy of it and stores the original out of the Exchange information store in Tier-2 storage.  Then, after an even longer period, say a couple of years or so, you can even stage it from Tier-2 (say slow disks like Clariion ATA) to Tier-3 (Tape) storage.

The cool part is, that there is a header file that stays in the user’s email that shows that it’s a vaulted email.  If they double click to open it like the would a normal email, it figures out where it is, and if it’s in Tier-2 storage it brings it up, if it’s in Tier-3 storage, it sends a tape request to the appropriate person so the tape can be recalled from off-site storage and restored.

For companies that have to retain data for 20+ years, how bloated can an email infrastructure get?  I’ve got 90G in my information store after the first year, and it’s only going to get worse from this point on.

Though I’ll bet EMC is frothing at the mouth at the idea. 😉

It’s Alive!

Well, after two days of serious work getting the cabling and power ready, EMC came out and spun up the Symm today.

Hot damn!  Finally some real storage and I can start planning the migration off the stupid Clariion. 😉

One down-side though, the Multiprotocol cards they shipped were the wrong type, since we have an M2 model DMX, the RDF cards were apparently shipped for an M1.

I have a question for anyone bored enough to be reading this.

Can anyone see any reason why 62.5micron fibre is used?  There is no signal-to-noise difference that I can see, and the range is not quite as long as the 50micron.  Our Network guy swears by the 62.5 and I keep telling him he’s on crack. 

However, thanks to the fact that he’s the one that designed the infrastructure, the links that go from the new datacenter to the old are all 62.5, and now I’m waiting on CDW to overnight me 6 new 10meter 62.5 micron cables to hook the new SAN into the old SAN so I can get the clariion migrated.

Always an adventure.  🙂

 

TimeFinder on MSSQL a possibility?

Let’s face it, if it’s Oracle, DB2, or anything along those lines, I can snap a copy and back it up with my eyes closed.

MSSQL, being a pretend database, has me stumped.  I’m so used to archive logs that I’m not even sure how to use TF/Snap to back up the database.

This is my understanding.  MSSQL doesn’t do “Archive Logging” in the traditional sense.

In a “REAL” database system the process is as follows:

1.  You put the database into “Hot Backup” mode.  In Oracle this quiesces the data files and writes all changes to the transaction log.  (When you take the database out of backup mode, the transactions in the log are then played into the database)

2.  When the above is complete, you can issue a command in one form or another to switch out the last transaction log, which closes one file and opens the next one, and then back up the database files along with the closed transaction log files via whatever file level backup process you have in place, whether it be TimeFinder or just having NetBackup pull the files from that server.

At Disney we did just that, with DB2, and moved in the neighborhood of 250+ Terabytes to tape every night.

At a number of other sites I’ve done the exact same process with Oracle.

Enter MSSQL, a Playschool excuse for an RDBMS, and I’m stumped.  See – the problem is there is never more than one “database.LDF” file for logging.  How am I supposed to quiesce writes to a logfile when it never closes it?

Then add that to the process for rolling transaction log backups forward in MSSQL is dependent on the idea that you used the MS Backup process to back it up.  It seems to be completely unaware of file level backups of the database.

I’m at a loss here – any ideas?

A little down-time?

Just figured out that the site was down for about a day or so this weekend.  Apparently during a re-model of the office/data center I disconnected the router and never even noticed that the site was down because between putting the dry-wall up and trying to sleep off the injuries to my aching muscles, I hadn’t checked the site.

I think I need to get What’s-Up on this network to do some monitoring, what do you think?

Yes, I’m geek enough to have a rack in my basement, no SAN though.  I used to have a Clariion 5300 that I picked up on EBay, but the power and cooling bills were killing me. 

So I opted instead for a Dell 2650 (2cpu + 2G) with 5 146G drives for the VMWare ESX server (running three virtual systems, one windows, 2 linux) that this site is run on.

I also have a generic Dual P3-850 for my backup server.  The Backup server has a half-terabyte (2x 250G) of storage for disk-disk-tape backup, (right now it’s disk-disk, because my library is offline until I can get some tapes bulk-erased.)

Ive noticed that most of the blog-sites are run off hosted wordpress sites, am I the only one nuts enough to take this particular project on himself?

RPO vs. RTO

I had an engineer friend of mine (real engineer, not affiliated with computers) once told me.

 “There are three options:

     1. You can have it faster.
     2. You can have it smaller.
     3. You can have it cheaper.

….Now pick any two.”

Over and over in my life I’ve put that theory to the test and to this day it has always held true.  The smaller and faster something is the more expensive it gets.  The cheaper something is the more it is slow and less portable.

Disaster Recovery and COOP (Continuation of OPerations – for the layman) follow a lot of the same rules.

There are Three main criteria you’re aiming for.  The main two are RPO and RTO.  That’s “Recovery Point Objective” and “Recovery Time Objective”

The third is, of course, cost.

RPO is defined by the point at which you need to be able to recover to.  Goals are sometimes easy to obtain, “Midnight on the morning of the failure” is usually pretty easily obtainable, as you can do that by restoring from backups.  Financial institutions aim for somewhat stricter objectives.  Most banks will require an RPO of “Zero” meaning “I want to see the last committed transaction on my DR site in the even the source site becomes a smoking hole in the ground.”

This is doable of course, provided the DR site is close enough to the source site to run dark fibre between the two with low enough latency to add negligible impact to production.  (the rule of thumb for synchronous replication is 2ms per 10k, that is for every 10 kilometers you’re adding 2ms of latency.  A normal physical drive has a latency of about 9-14ms, so if you go to far you’re going to slow your system to a crawl.

RTO is defined as “how long can I afford to have my environment down to affect a failover.”  I’ve worked in one environment where transaction logs were backed up to tape and shipped across country from the L.A. area to Orlando, Florida, where the tapes were then restored into a standby system.  The recovery time to a 15 minute increment was effectively days, because they actually had to wait for the last tape to make it to the target site before they could restore it and bring the system on-line.  It was Insane.

Your goal is to get RPO and RTO to as close to zero as possible without bankrupting the budget (or the company).

An RPO of zero can be obtained with a DR site within about 10 kilometers, 20 if you can live with the slower response times in production.  This is full synchronous transfer from one array to another, every write from host to disk has to be acknowledged by the REMOTE array before it is reported to the host that the write is committed.

EMC’s SRDF/A and SRDF/AR mitigate that in environments where the DR site is far enough away as to kill any chance of SRDF/Syncronous working. 

SRDF/A is a “packetized” SRDF, where the receiving Symm has to receive two consecutive “checkpoints” before it commits the block of data.  That way if an incomplete block is received, it’s discarded to prevent data corruption resulting from incomplete write information.  The downside to SRDF/A of course it that it requires an insane amount of cache to function properly.  (And don’t let an SE tell you it doesn’t, he’s lying or not capable of understanding that for the remote Symm to receive a block of data, it has to be able to store it somewhere other than disk until it receives two checkpoints.

SRDF/AR is an automated replication product.  You are essentially mirroring production to a TimeFinder BCV, which is then sent synchronously to the remote site.  You can run a Sync transfer because the BCV’s are not connected to the production volumes, and as such the production volumes do not require any ACK/NAK from the remote system.  Depending on the time it takes to replicate (how fast the pipe is between the two sites) you can get RTO to about 10 minutes, which is good enough for most.  The effects of SRDF/AR can be duplicated by anyone proficient in Korn shell, as it literally runs a series of waits and whiles for each stage of the process.  AR has the added bonus that you can actually keep a second set of BCV’s on the target host and run your backups from them.  The down-side to the AR type of scenario (whether it be SRDF/AR or a scripted set-up) is that it costs disks – and lots of them.  There are the production volumes, mirrored, the first set of BCV’s, unprotected, the SRDF target devices (Mirrored or Raid5) and the second set of BCV’s.

Scary huh?

As I prepare to start my own replciation design this was formost on my mind, which is how it ended up here.  (this is after all the dumping ground for my random thoughts)

It’s Layoff time again – everyone say “Thanks EMC”

You know, corporations (like EMC) must think we’re stupid.

I read in eWeek ( Link ) that EMC is going to kick 1,250 people out on their asses right before the holidays because they failed to make it last quarter.

I guess people are just so much chattle, to be sold off whenever profits need a boost.  What most people (investors) don’t know, is that kicking 1250 people to the curb does absolutly NOTHING to bolster their bottom line.

They won’t lay off sales people, they are the “bread and butter” of the organization, right?  And obviously it won’t be the engineers, they are busy busy designing the new product.

That leaves the lowly professional services people.

I’ve been working with and for EMC (directly and indirectly) for close to seven years now.  I can tell you one thing.  When they lay-off PS people, they will, with a 100% certainty, have to turn to consultants and partners to get the promised work done.  Partner companies cost about 3-5 times as much as internal employees  (probably a bit less when you take benefiits into consideration). 

Eventually, and I can now count 4 times this has happened, they will realize how much money they are spending on partners and consultants and declare a moratorium on partner utilization.  This requires them to hire the people they laid off, usually at higher salaries (because these same people went to work for the partners when they got laid off) and  usually only get the lower quality employees back.

(Sorry EMC, but the partners pay better than you do, by a good sized margin.)

They try to entice people to hire in, and in my case specifically, by telling me what a stable place EMC was and how I would be insulated from the ups and downs of the market.

Apparently not – 1,250 people are going to find out how insulated they really are.

Right before Christmas….

Oh – and by the way.  Some of your customers (me included) take note of how you treat people.  I know I’m shopping around quotes right now that I might have otherwise gone straight to EMC for.

 

NAS Management

EMC Control Center has more management tools for Network Appliance filers than they do for their own hardware.

I wonder if this means they prefer the NetApp devices?

Is Microsoft VSS a real Snap? Maybe. Does it suck? Absolutly!

I can’t even talk during the day because of the great sucking sound coming from our microsoft infrastructure.

From a storage end, it’s even harder, because natively Microsoft doesn’t have ANY tools to unmount a filesystem or quiesce a production volume so you can take a hardware based snapshot of it.

Of course they’ve introduced VSS, which is like saying that there is never any way but their way to clone a volume.

The main problem with VSS (besides it being a product of the limited minds at microsoft) is that it’s yet another stupid host-based application that requires system resources on the host when engaged.

VSS, and most other volume “Snapshot” providers, work in the same way.  The simplistic description is “Copy on first write.”

Let’s go over it step-by-step.

Continue reading