Disaster Recovery has come a long way since the days when companies did a backup and aimed for a "best effort" recovery. In fact, the term "disaster recovery" has evolved. No longer do we narrow "disasters" down to a hole in the ground or a fire in the data centre. "Disaster recovery" became "business recovery" since, after all, what you want is not merely to get your computer systems up and running again: you want a business up and running.
I've been working recently somewhere where the Recovery Time Objective (RTO) - the time it takes to get up and running again without unacceptable business impact - is hopelessly long. The company's recovery plan was a warm backup, and in reality it would be three X 24 hour days to get the systems up again in time to start manufacturing and deliveries. That's three days of hundreds of staff twiddling their thumbs, and the flow-on effect is simply disastrous for the business. So why so long before they can be up and running again? Quite simply, because the Disaster Recovery plan (as they still call it) was prepared several years ago. It's technically correct and workable, but the business simply wouldn't wear the outage.
Clearly it's time for this business, and many others, to revisit their Business Recovery. Or, better still, their Business Continuity Plans.
Recovery Time Objective
Not only do you need to ask how long it takes to get the system fully operational for business. You also need to address the point to which you're going to restore. For example, if you're happy to lose all of today's data, you may be able to go back to last night's backup. It really depends on the business requirements. Whenever I hear someone say: "it only needs to be backed up once a week", or (worse) "we don't need a backup for it", I shudder. It really depends on how much effort it takes to rebuild, especially without a backup. Losing a development system with 15 developers and users, just prior to going into production, may be far more of a business impact than a production system that is only accessed very occasionally by one or two users.
Block Level Replication vs. Byte Level
With so many components to the IT infrastructure, it's important to look at the components as a group. You may need, for example, a set of file systems from one LPAR, along with a set of file systems from different LPARs, all to be replicated in a consistent state. One missing link in the chain and the whole thing will fall apart. Plenty of businesses rely on database replication or replication at the storage level. However, if these end up being block level replication, the entire block gets replicated even if there's only a tiny part of the data that's changed. It's a little like having to replace an entire disk when you want to recover a small file. True, a block is not a whole disk, but you get the idea.
What block-level replication gives is point-in-time snapshots, in order to create recovery points. Now that's not just a problem for replicating larger amounts of data than you need - whole blocks. It's also a problem for another reason: the snapshots are needed to put applications into data consistent states.
Another approach is to do byte-level replication. Of course, with any replication, you only need to be replicating files/blocks/bytes which have changed. That means additions, changes to data or deletions (insert / update / delete). If the data protection is down to the very last byte, the data is always maintained in a data consistent state. That also means less data needs to be transmitted than entire blocks, or entire files.
Running on Inertia
I have to say that IBM Power Systems are incredibly robust, and seem to run on inertia. I've seen some very ageing systems out there, especially in very small businesses (although some banks are guilty of shockingly outdated infrastructure), and the systems just seem to soldier on. I suppose it all comes down to risk and cost. So Business Continuity can be seen as an insurance policy. I can think of some Telephone Support Centres that are absolutely critical to a business' bread and butter. If those telephone systems are down, and haven't been backed up, the recovery may be extremely painful and expensive to a business. Then again, business criticality depends on a lot of factors. I can think of a business that runs a Hyperion budgeting system. That system really only becomes critical for a few weeks a year, and the rest of the time is hardly touched.
It's worth taking a fresh approach to disaster recovery, and think of it in terms of business continuity. Just because the DR plan was tested and apparently works may not be sufficient to fulfil your business needs. How fast does the system have to be back up? How much data are you allowed to lose? It's really important to work through these questions, and find a solution (or combination of solutions) that will reduce your Recovery Time (RTO) and Recovery Point to a state that the business is willing to wear.