What Is Disaster Recovery (DR)?

What is DR?

Disaster recovery (DR) consists of IT technologies and best practices designed to prevent or minimize data loss and business disruption resulting from catastrophic events—everything from equipment failures and localized power outages to cyberattacks, civil emergencies, criminal or military attacks and natural disasters.

Many businesses—especially small- and mid-sized organizations—neglect to develop a reliable, practicable disaster recovery plan. Without such a plan, they have little protection from the impact of significantly disruptive events.

Infrastructure failure can cost as much as USD 100,000 per hour (link resides outside ibm.com), and critical application failure costs can range from USD 500,000 to USD 1 million per hour. Many businesses cannot recover from such losses. More than 40% of small businesses will not re-open after experiencing a disaster, and among those that do, an additional 25% will fail within the first year after the crisis. Disaster recovery planning can dramatically reduce these risks.

Disaster recovery planning involves strategizing, planning, deploying appropriate technology, and continuous testing. Maintaining backups of your data is a critical component of disaster recovery planning, but a backup and recovery process alone does not constitute a full disaster recovery plan.

Disaster recovery also involves ensuring that adequate storage and compute is available to maintain robust failover and failback procedures. Failover is the process of offloading workloads to backup systems so that production processes and end-user experiences are disrupted as little as possible. Failback involves switching back to the original primary systems.

Read our article to learn more information about the important distinction between backup and disaster recovery planning.

Realize the full value of your hybrid cloud

Connect and integrate your systems to prepare your infrastructure for AI.

Business impact analysis

The creation of a comprehensive disaster recovery plan begins with business impact analysis. When performing this analysis, you’ll create a series of detailed disaster scenarios that can then be used to predict the size and scope of the losses you’d incur if certain business processes were disrupted. What if your customer service call center was destroyed by fire, for instance? Or an earthquake struck your headquarters?

This will allow you to identify the areas and functions of the business that are the most critical and enable you to determine how much downtime each of these critical functions could tolerate. With this information in hand, you can begin to create a plan for how the most critical operations could be maintained in various scenarios.

IT disaster recovery planning should follow from and support business continuity planning. If, for instance, your business continuity plan calls for customer service representatives to work from home in the aftermath of a call center fire, what types of hardware, software, and IT resources would need to be available to support that plan?

Risk analysis

Assessing the likelihood and potential consequences of the risks your business faces is also an essential component of disaster recovery planning. As cyberattacks and ransomware become more prevalent, it’s critical to understand the general cybersecurity risks that all enterprises confront today as well as the risks that are specific to your industry and geographical location.

For a variety of scenarios, including natural disasters, equipment failure, insider threats, sabotage, and employee errors, you’ll want to evaluate your risks and consider the overall impact on your business. Ask yourself the following questions:

What financial losses due to missed sales opportunities or disruptions to revenue-generating activities would you incur?
What kinds of damage would your brand’s reputation undergo? How would customer satisfaction be impacted?
How would employee productivity be impacted? How many labor hours might be lost?
What risks might the incident pose to human health or safety?
Would progress towards any business initiatives or goals be impacted? How?

Prioritizing applications

Not all workloads are equally critical to your business’s ability to maintain operations, and downtime is far more tolerable for some applications than it is for others. Separate your systems and applications into three tiers, depending on how long you could stand to have them be down and how serious the consequences of data loss would be.

Mission-critical: Applications whose functioning is essential to your business’s survival.
Important: Applications for which you could tolerate relatively short periods of downtime.
Non-essential: Applications you could temporarily replace with manual processes or do without.

Documenting dependencies

The next step in disaster recovery planning is creating a complete inventory of your hardware and software assets. It’s essential to understand critical application interdependencies at this stage. If one software application goes down, which others will be affected?

Designing resiliency—and disaster recovery models—into systems as they are initially built is the best way to manage application interdependencies. It’s all too common in today’s microservices-based architectures to discover processes that can’t be initiated when other systems or processes are down, and vice versa. This is a challenging situation to recover from, and it’s vital to uncover such problems when you have time to develop alternate plans for your systems and processes—before an actual disaster strikes.

Establishing recovery time objectives, recovery point objectives, and recovery consistency objectives

By considering your risk and business impact analyses, you should be able to establish objectives for how long you’d need it to take to bring systems back up, how much data you could stand to use, and how much data corruption or deviation you could tolerate.

Your recovery time objective (RTO) is the maximum amount of time it should take to restore application or system functioning following a service disruption.

Your recovery point objective (RPO) is the maximum age of the data that must be recovered in order for your business to resume regular operations. For some businesses, losing even a few minutes’ worth of data can be catastrophic, while those in other industries may be able to tolerate longer windows.

A recovery consistency objective (RCO) is established in the service-level agreement (SLA) for continuous data protection services. It is a metric that indicates how many inconsistent entries in business data from recovered processes or systems are tolerable in disaster recovery situations, describing business data integrity across complex application environments.

Regulatory compliance issues

All disaster recovery software and solutions that your enterprise have established must satisfy any data protection and security requirements that you’re mandated to adhere to. This means that all data backup and failover systems must be designed to meet the same standards for ensuring data confidentiality and integrity as your primary systems.

At the same time, several regulatory standards stipulate that all businesses must maintain disaster recovery and/or business continuity plans. The Sarbanes-Oxley Act (SOX), for instance, requires all publicly held firms in the U.S. to maintain copies of all business records for a minimum of five years. Failure to comply with this regulation (including neglecting to establish and test appropriate data backup systems) can result in significant financial penalties for companies and even jail time for their leaders.

Choosing technologies

Backups serve as the foundation upon which any solid disaster recovery plan is built. In the past, most enterprises relied on tape and spinning disks (HDD) for backups, maintaining multiple copies of their data and storing at least one at an offsite location.

In today’s always-on digitally transforming world, tape backups in offsite repositories often cannot achieve the RTOs necessary to maintain business-critical operations. Architecting your own disaster recovery solution involves replicating many of the capabilities of your production environment and will require you to incur costs for support staff, administration, facilities, and infrastructure. For this reason, many organizations are turning to cloud-based backup solutions or full-scale Disaster-Recovery-as-a-Service (DRaaS) providers.

Choosing recovery site locations

Building your own disaster recovery data center involves balancing several competing objectives. On the one hand, a copy of your data should be stored somewhere that’s geographically distant enough from your headquarters or office locations that it won’t be affected by the same seismic events, environmental threats, or other hazards as your main site. On the other hand, backups stored offsite always take longer to restore from than those located on-premises at the primary site, and network latency can be even greater across longer distances.

Continuous testing and review

Simply put, if your disaster recovery plan has not been tested, it cannot be relied upon. All employees with relevant responsibilities should participate in the disaster recovery test exercise, which may include maintaining operations from the failover site for a period of time.

If performing comprehensive disaster recovery testing is outside your budget or capabilities, you can also schedule a “tabletop exercise” walkthrough of the test procedures, though you should be aware that this kind of testing is less likely to reveal anomalies or weaknesses in your DR procedures—especially the presence of previously undiscovered application interdependencies—than a full test.

As your hardware and software assets change over time, you’ll want to be sure that your disaster recovery plan gets updated as well. You’ll want to periodically review and revise the plan on an ongoing basis.

The IBM Knowledge Center provides an example of a disaster recovery plan.

Disaster Recovery-as-a-Service (DRaaS)

Disaster-Recovery-as-a-Service (DRaaS) is one of the most popular and fast-growing managed IT service offerings available today. Your vendor will document RTOs and RPOs in a service-level agreement (SLA) that outlines your downtime limits and application recovery expectations.

DRaaS vendors typically provide cloud-based failover environments. This model offers significant cost savings compared with maintaining redundant dedicated hardware resources in your own data center. Contracts are available in which you pay a fee for maintaining failover capabilities plus the per-use costs of the resources consumed in a disaster recovery situation. Your vendor will typically assume all responsibility for configuring and maintaining the failover environment.

Disaster recovery service offerings differ from vendor to vendor. Some vendors define their offering as a comprehensive, all-in-one solution, while others offer piecemeal services ranging from single application restoration to full data center replication in the cloud. Some offerings may include disaster recovery planning or testing services, while others will charge an additional consulting fee for these offerings.

Be sure that any enterprise software applications you rely on are supported, as are any public cloud providers that you’re working with. You’ll also want to ensure that application performance is satisfactory in the failover environment, and that the failover and failback procedures have been well tested.

Cloud DR

If you have already built an on-premises disaster recovery (DR) solution, it can be challenging to evaluate the costs and benefits of maintaining it versus moving to a monthly DRaaS subscription instead.

Most on-premises DR solutions will incur costs for hardware, power, labor for maintenance and administration, software, and network connectivity. In addition to the upfront capital expenditures involved in the initial setup of your DR environment, you’ll need to budget for regular software upgrades. Because your DR solution must remain compatible with your primary production environment, you’ll want to ensure that your DR solution has the same software versions. Depending upon the specifics of your licensing agreement, this might effectively double your software costs.

Not only can moving to a DRaaS subscription reduce your hardware and software expenditures, it can lower your labor costs by moving the burden of maintaining the failover site to the vendor.

If you’re considering third-party DRaaS solutions, you’ll want to make sure that the vendor has the capacity for cross-regional multi-site backups. If a significant weather event like a hurricane impacted your primary office location, would the failover site be far enough away to remain unaffected by the storm? Also, would the vendor have adequate capacity to meet the combined needs of all its customers in your area if many were impacted at the same time? You’re trusting your DRaaS vendor to meet RTOs and RPOs in times of crisis, so look for a service provider with a strong reputation for reliability.

Read “Disaster Recovery as a Service (DRaaS) vs. Disaster Recovery (DR): Which Do You Need?” for a comparative overview of both solutions.