EMC Failure Brings Down State of Virginia
I'm down here in Australia, where the government is a bit stalled for the past two weeks at the moment, known formally as being managed by the [Caretaker government]. Apparently, there is a gap between the outgoing administration and the incoming administration, and the caretaker government is doing as little as possible until the new regime takes over. They are still counting votes, including in some cases dummy ballots known as "donkey votes", the Australian version of the hanging chad. Three independent parties are also trying to decide which major party they will support to finalize the process.
While we are on the topic of a government stalled, I feel bad for the state of Virginia in the United States. Apparently, one of their supposedly high-end enterprise class EMC Symmetrix DMX storage systems, supporting 26 different state agencies in Virginia, crashed on August 25th and now more than a week later, many of those agencies are still down, including the Department of Motor Vehicles and the Department of Taxation and Revenue.
Many of the articles in the press on this event have focused on what this means for the reputation of EMC. Not surprisingly, EMC says that this failure is unprecedented, but really this is just one in a long series of failures from EMC. It reminds me of the last time EMC had a public failure with a dual-controller CLARiiON a few months ago that stopped another company from their operations. There is nothing unique in the physical equipment itself, all IT gear can break or be taken down by some outside force, such as a natural disaster. The real question, though, is why haven’t EMC and the State Government been able to restore operations many days after the hardware was fixed?
In the Boston Globe, Zeus Kerravala, a data storage analyst at Yankee Group in Boston, is quoted as saying that such a high-profile breakdown could undermine EMC’s credibility with large businesses and government agencies. “I think it’s extremely important for them,’’ said Kerravala. “When you see a failure of this magnitude, and their inability to get a customer like the state of Virginia up and running almost immediately, all companies ought to look at that and raise their eyebrows.’’
Was the backup and disaster recovery solution capable of the scale and service level requirements needed by vital state agencies? Had they tested their backups to ensure they were running correctly, and had they tested their recovery plans? Were they monitoring the success of recent backup operations?
Eventually, the systems will be back up and running, fines and penalties will be paid, and perhaps the guy who chose to go with EMC might feel bad enough to give back that new set of golf clubs, or whatever ridiculously expensive gift EMC reps might offer to government officials these days to influence the purchase decision making process.
(Note: I am not accusing any government employee in particular working at the state of Virginia of any wrongdoing, and mention this only as a possibility of what might have happened. I am sure the media will dig into that possibility soon enough during their investigations, so no sense in me discussing that process any further.)
So what lessons can we learn from this?