Troubleshooting? Try actually figuring out the problem instead of guessing.

I’ve seen this time and time again both with people I work with and support personel.

Figure out the damned problem, don’t throw rocks at it hoping you’ll get a hit.  When I was in tech-school my instructor called this “shotgun troubleshooting” and siad it’s a sure sign of an inept technical support person.

When I worked for America On-Line, our standard troubleshooting tool was “D&R” for “Delete and Reinstall”.  I got poor reviews for taking the time to actually fix problems, instead of shuffling people off the phone as quickly as I could.  (Keep in mind, I used to support version 2.5 and 3.0 of the AOL software, so it’s been a while) Working for microsoft it wasn’t as common but was still something that was used more often than not in place of actual troubleshooting.

It’s very easy to solve a problem by dropping a bomb on the software, but it’s a bad way of doing business.  When a support person tells me to D&R the software, my first response is “WHY?” and my second response is “Get me someone who knows what the hell they’re doing.”  (Especially in the case of microsoft, who has the nerve to charge me for this bad advise)

My main complaint about D&R is that it eliminates any chance of figuring out what the problem was, which in turn means you can’t take precautions to keep said errors from re-occuring.

Exchange backups a problem?

74 Gigs should *NOT* take 24 hours to back up.  Keep in mind, we are not going to tape, we are backing up to Disk Storage Units and then copying backup images from disk to tape later.  So tape bandwidth is not the issue here.

I’ve been working on an exchange backup problem.  Now I know that the exchange server in question was not set up as “best practices”.  Single information store, (used to be installed on the C: drive, we finally moved that) for about 350 users.  The new exchange server is coming online soon (not soon enough for my tastes) but for now this is what I have to work with.

A single stream backup has taken about 24 hours to complete, even for differentials, using the default directive:

Microsoft Exchange Mailboxes:\

You create a single stream for all mailboxes.

So knowing there has to be a better way to do this, I tried the usual wildcard, as follows:

Microsoft Exchange Mailboxes:\*

With disasterous results.  The system spawned 400+ backup streams, which held the entire backup environment hostage.  Half the backup jobs couldn’t run within the 5 hour window we had set for ourselves.  A little research through the Symantec/Veritas site (their site is not exactly easty to sift through) turns up the following set of directives in the “Exchange Administrator’s Guide”:

NEW_STREAM
Microsoft Exchange Public Folders:\
NEW_STREAM
Microsoft Information Store:\
NEW_STREAM
Microsoft Exchange Mailboxes:\[a-e]*
NEW_STREAM
Microsoft Exchange Mailboxes:\[f-j]*
NEW_STREAM
Microsoft Exchange Mailboxes:\[k-o]*
NEW_STREAM
Microsoft Exchange Mailboxes:\[p-t]*
NEW_STREAM
Microsoft Exchange Mailboxes:\[u-z]*

Now the first two are easy – back up the public folders, and back up the information store as a whole.  Backing up the information store as well as the mailboxes can be said to be a bit redundant, however this is a big deal if you have to do a full restore – restoring from the Information store backup is much much faster than the item by item backup.  It’s a waste of time until you’re down and need it, so is worth the extra time / storage. 
The remaining directives group a collection of mailboxes into a single stream.  In our case we’ve put all mailboxes starting with A through all mailboxes starting with E in the single stream.  F through J in the second stream, etc.

There is further tuning that can be done, moving sets of mailboxes from one stream to another to balance them out.

Only time will tell if this really helps.  My testing indicates that using 5 streams, show that the 500-600 Kps slows to 300-400kps, but when multiplied by 5 streams it still looks like this might be an improvement.

Switch preferences – Cisco vs. Brocade vs. Mcdata

It’s like religion with some people.  Cisco vs. Brocade.

As someone who is no longer affiliated with any storage manufacturer or vendor, I can finally voice a real opinion.

As a consultant, I’ve installed more than 100 switches of every size and flavor.  The most disasterous install was a job in Norfolk, VA, where a DS-12000B (brocade) “director” had to be swapped out at the last minute for a defective backplane.  (Every time we plugged a blade into it the blade would fail – permanently)

If given the choice I will always prefer to go with the McData for just plain old brute force reliability.  I know of at least a dozen of the old ED-1032 switches that are still functioning in a production environment.  They last forever, are easy to manage, integrate into most of the emc packages seamlessly, and are great bang for the buck.

That being said, some of my best recent experiences have been with the new Cisco MDS series switches.  Now they take some getting used to, but once you set them up they require minimal management.

Cisco – with VSAN support and it’s internal routing capability (FCIP is a real possibility with the Cisco) is hands down one of the best switches on the market.  Using VSAN’s you can logically carve up a switch into multiple virtual switches, guaranteeing no cross-talk between certain ports (I believe each VSAN also has it’s own fibre name-server as well)  Of course if you’re zoning properly, with only a single initiator and single target in every zone, you don’t have to worry about this…

The downside of Cisco of course is that to use the command line, you have to be quite a bit more knowledgable about the Cisco IOS, as the fabric os follows most of the same command syntax.  (IE – to disable a port you type “shut” to enable it you type “no shut”)

The GUI is fairly intuitive on both, though Cisco requires a *LOT* of java and the management server has to install and run on the system you’re managing it from.  (unlike the brocade, which can be managed from any host with a basic java version installed.

The Cisco seems to have the best Interoperability – working well with most other Vendor’s hardware, I’ve yet to see something that won’t plug into a Cisco switch, including a brocade switch. 🙂

if you’re not planning on doing anything fancy and you just need a good reliable switch, I’ve got to say go with McData.  But if you want VSAN capability, FCIP, a switch that can route IP like the best routers in the world, buy a Cisco.

If you’re going to buy a Cisco switch, my only suggestion is that you buy the switch from Cisco, and not from EMC.  Cisco support (dial-home) is built into the swtich, but currently EMC has no method of remotely monitoring cisco switches.  (They suggest using ECC, which works, but puts someone between the switch and the EMC Software Assistance Center (SAC).  So if a switch or component fails, it’s up to you to both notice it and call it in before the clock even starts on the repair.  (the 4 hour response window is from the point when EMC is first notified of the issue).

Of course EMC’s sales force might disagree.

About the system:

Ok, I know the question is coming, so here it is:

 This system is running on a Dell PowerEdge 2650 with 2x 2ghz processors and 2GB of RAM.  Running VMWare with 4 virtual machines, only one of which has anything to do with the site.

The VM in question is RedHat, the blog site software is WordPress (www.wordpress.com), and the database backend is MySQL.

My internet connection is a 3mbit DSL connection for now.  As soon as the need justifies I’ll put fibre in.

The system purrs quietly in my basement driving the temperature and electric bill up. 🙂

Jesse

 

Raid as backup? Only if you have your resume spell checked.

So my cousin runs a small ISP in the Phoenix, Arizona area.  Nothing special, a few hundred DSL and Dial-Up users, webmail, etc. 

A couple of weeks ago, I was down visiting and he was wringing his hands over a technical “glitch.”   Apparently a drive in his raid set had failed and due to either a controller bug (rare but possible) or user error (much more common) the blank disk started rebuilding over the parity information. 

He asked my advice – I told him easy.  Let the raid set finish rebuilding and restore from your most recent full backup, then lay whatever incremental backups you have over that to bring it as close to POF (Point-of-Failure) as humanly possible.  It’s not sophisticated enough of a system to expect that they would be doing any kind of transaction logging.

Apparently they weren’t doing backups at all.  The feeling being that since they had their disks protected in a RAID configuration, that backups were a uselessly redundant exercise.

Let me explain why this is a bad idea – Garbage in – Garbage out.  RAID, whither it be Raid1 or Raid5 only gets you uptime.  Because a corruption will replicate and spread from disk to disk before the user is aware there is a problem.

The same holds true for replication.  If you’re replicating a harddrive offsite, a database corruption will replicate right along with the production data.  The only exception being in cases of transactional replication. (such as Quest Software’s old “NetBase” product, which detects an invalid change and halts replciation before sending the change to the target system.)

So how to implement a backup?  It’s easy.  Follow the 2of3 rule (faster, cheaper, or smaller – pick any two) and you will have a backup solution that you can live with.

Or keep your resume spell-checked, because it’s not a matter of if, it’s a matter of when.

Hello storage world!

Jesse Gilleland - 2001My first post – I always thought that when I finally broke down and started ranting and raving about something that it would be about politics, or the travel industry, or even my devestating disappointment in the lack of quality network television. 

No – I decide to write about work.

First a little about me.  I’ve been in the storage industry, directly for about seven years, and indirectly for about 10.  I started out as a customer working for Intuit, Inc.  (www.intuit.com)  You know, the guys that make Turbo Tax, Quicken, and other money-tracking software and the like.  We installed our first Symmetrix in 1997, and I was sold.  We got a 30-50% performance improvement in our old HP3000 MPE systems almost immediately.

Liked it so much that when EMC offered me an R&D position at their Hopkinton, Massachusetts headquarters, I jumped all over it.  Spent almost two years in direct immersion learning everything from basic disk-storage principles to advanced replciation software, to debugging fibrechannel packets to figure out exactly where a new device driver was going wrong.

In short – tons of fun for a geek like me.

Since then I’ve picked up Clarrion, Centerra, Celerra, and just about every EMC software product on the market, as well as most of what Symantec / Veritas (www.symantec.com) has to offer.

So now a friend referred to me as the “San God” and the EMC consultants call me when they need help.  Kind of nice but I’m hoping that I can answer more questions here and stop chewing up cell-phone minutes. 😉

So I welcome any post, so long as you’re not spamming, and will be happy to put my money where my mouth is and give advise and help where I can.  Just remember that I do still have a day-job. 🙂

If I don’t know it, chances are I know someone who does.

So Welcome – and happy storing!

Jesse Gilleland