A Rant about RAID, with a Bad Metaphor about Eggs, and No Happy Ending.Posted by Dylan Beattie on 03 November 2008 • permalink
I went in to work this morning and my main workstation had died over the weekend. Bluescreen on boot, no safe mode, nothing. Windows Update gone bad? We'l l probably never know, given I don't think it's coming back any time soon... but, as with previous overnight machine suicides, it looks like a problem with SATA RAID - specifically, two WD Velociraptors in a RAID-1 (mirror) array controlled by an Intel ICH10R chipset on an Asus P5Q motherboard.
You know your whole eggs & baskets thing, right? SATA RAID is like carefully dividing your eggs into two really good baskets, then tying them together with six feet of wet spaghetti and hanging them off a ceiling fan.
Long story short, I lost a day, and counting. I had to split the mirror into individual drives, switch the BIOS back to IDE, which gave me a bootable OS but - seriously - no text. No captions, no icon labels, no button text, nothing. Just these weird, ghostly empty buttons. Running a repair off the WinXP x64 CD got my labels back, but somehow left Windows on drive D. Another half-hour of registry hacks to get it back to drive C: where it belongs, and I had a creaking but functional system - VS2008 and Outlook are working, but most of my beloved little apps are complaining that someone's moved their cheese. Reinstalling is probably inevitable, along with the deep, deep joy that is reinstalling Adobe Creative Suite when your last remaining "activation" is bound to a PC that now refuses to deactivate it. Even Adobe's support team don't understand activation. Best they could come up with was "yes, that means there's no activations on that system." Err, no, Mr. Adobe, there are. It was very clear on that point. Wouldn't let me run Photoshop without it, you see. "Oh... then you'd better just reformat, and when you reinstall, you'll need to phone us for an activation override". Thanks, guys. I feel the love.
Sorry, I digress. This whole experience is all the more frustrating because RAID mirrors are supposed to be a Good Thing. If you believe the theory, RAID-1 will let you keep on working in the event of a single drive failure. Well... In the last 5 years or so, I haven't had a single workstation die because of a failed hard drive, but I've lost count of the number of times an Intel SATA RAID controller has suddenly thrown a hissy-fit under Windows XP and taken the system down with it. Every time it starts with a bit of instability, ends up a week or two later with bluescreens on boot and general wailing and gnashing of teeth, and every time, running drive diagnostics on the physical disks shows them to be absolutely fine.
This is across four different Intel motherboards - two Abit, one Asus, and a Dell Precision workstation - running both the ICH9R (P35) and ICH10R (P45) chipsets, and various matched pairs of WD Caviar, WD Raptor, WD Velociraptor and Seagate drives. One system was a normal Dell Precision workstation, the others are various home-built combinations, all thoroughly memtest86'ed and burned-in before being put into production doing anything important.
Am I doing something wrong here? I feel like I've invested enough of both my and my employer's time and money in "disaster-proofing" my working environment, and just ended up shooting myself in the foot. I'm beginning to think that having two identical workstations, with a completely non-RAID-related disk-mirroring strategy, is the only way to actually guarantee any sort of continuity - if something goes wrong, you just stick the spare disk in the spare PC and keep on coding. Or hey, just keep stuff backed up and whenever you lose a day or two to HD failure, tell yourself it's nothing compared to the 5-10 days you'd have lost if you'd done something sensible like using desktop RAID in the first place.
[Photo from bartmaguire via Flickr, used under Creative Commons license. Thanks Bart.]