Disaster recovery in a cloud environment

Plan for a complete loss of service from your cloud provider

Even the smallest organizations need to prepare for business continuity in case of a disaster — a complete loss of service from a cloud provider is highly unlikely, but it is irresponsible not to plan for it. Disaster recovery can be complicated, but it becomes much simpler in a cloud computing environment. Discover the steps the author took during a recent disaster recovery exercise at his organization and learn how you can use the process provided as a template for your own disaster recovery efforts.


Bill Robbins, System Engineer, The Educopia Institute

Bill RobbinsBill Robbins is the systems administrator for the Educopia Institute, a nonprofit organization that runs a Linux®-based infrastructure on the Amazon EC2 cloud. He holds a Masters of Science degree in electrical engineering and a Bachelors Degree in the same from the Georgia Institute of Technology. Prior to joining Educopia in 2008, he worked in IT and network management at BellSouth and Emory University and as a design engineer for telecommunications companies in Florida and Georgia. He has worked with many varieties of UNIX since before there were graphical terminals.

07 February 2011

Also available in Chinese Russian Japanese

Many small companies are taking advantage of the simplicity provided by moving their infrastructure into a cloud computing environment. Many don't realize, however, that they should still prepare a disaster recovery plan for that infrastructure. The cloud changes what's needed for disaster recovery; it's different that what you need for server-client and network infrastructures. Those differences and the steps you need to take are the topic of this article.

My own small organization, the Educopia Institute— an organization for planning and implementing shared cyberinfrastructure projects for scholarly communication, recently went through a complete disaster recovery exercise and found that the cloud computing environment made the process quite simple. In this article, discover the steps we took.

Frequently used acronyms

  • DNS: Domain Name System
  • SSL: Secure Sockets Layer

Note: The information in this article also applies if you have a physical production environment but run your development environment in the cloud.

Virtual by definition

By definition, running in the cloud means that you are using a virtual server. It is much easier and less expensive to make a ready-to-run copy of a virtual server than it is a physical server: You don't need extra hardware for a disaster recovery instance of the server, although you will need to have the image stored at the disaster recovery cloud vendor. A virtual server loads and runs from scratch much more quickly than you could ever rebuild a physical server from scratch. This ready-to-run copy of your virtual servers is at the heart of your disaster recovery plan.

To get started on your disaster recovery plan, you need to have a second physical location that can run your server image. This location should be at least a few hundred miles from the location of your primary servers. Because your personnel are already connecting to the server via the Internet, you won't need to move your team to the disaster site.

Migrating within Amazon EC2

The Download section provides an example of the ec2-migrate-manifest command [description | syntax | options] my organization used to migrate our infrastructure within Amazon EC2.

My organization uses Amazon Elastic Compute Cloud (Amazon EC2) and it is relatively straightforward to move from one Amazon data center to another. The magic command needed in the Amazon cloud is ec2-migrate-manifest. If you are moving from one vendor's data center to another vendor's data center, however, more is involved than if you are going from one data center to another data center from the same vendor. There are many cloud computing environment vendors; you have to determine your best alternative for disaster recovery. In either case, whether you use Amazon EC2 or another vendor, you must first create a small proof of concept before committing to a full-blown disaster recovery exercise.

Creating a proof of concept

In this proof of concept, you need to copy or migrate a complete image of a virtual server from the primary data center to the disaster recovery data center. Initially, this image need not be the latest complete image of a production server: You are just trying to prove that you can re-create a server from one cloud in another cloud's location. The proof of concept does not need to run live data or have a real web address.

No large commitment is needed here, either. It is inexpensive to set up some storage space and run a small server at all the cloud computing environment vendors. If you can, use your current cloud computing environment for disaster recovery; then you are that much closer. When you have a successful run of the proof of concept, you are ready to continue with your plan.

Remember that you probably won't use this vendor or site anytime soon: You just need to keep a close enough eye to know if it is going out of business. This disaster recovery vendor will get an image delivery from you on a regular basis; this is how you know if something is breaking down.

What must run at your disaster recovery site

Obviously, you need to know and document what you actually have running and set up in your cloud and this effort means you need to gather data.

At the end of this effort, carefully evaluate what must move to the disaster recovery site. This is not a simple case of "just move everything": You may have features or functions that are in place for testing purposes or are not critical and can be recovered later.

How do you go about finding out everything that you actually have in your cloud? You might have it documented; verify that the documentation is up to date. Login to a shell on your running production server and be sure it is operating normally. Perform the following steps to create files that help you record the items on your server (the commands shown should work on any variant of Red Hat Linux®):

  1. Record the process that are running:
    ps –ef  > /tmp/procs.txt
  2. Determine the active connections on your server:
    netstat -an > /tmp/connects.txt
  3. Determine the file systems on your server:
    df -ah > /tmp/mounts.txt
  4. Record the running cron jobs:
    cd /var/spool/cron
    more * > /tmp/crons.txt

Because you are moving a virtual server, not rebuilding a server from scratch, there is no need to identify every software package and every module (like Apache or Perl modules) or Ruby GEM on your system. All these elements will be there because you are copying virtual images.

This list of connections will help you determine the security and firewall settings needed at the disaster recovery site. Also important: Whatever other servers you allow access to and whatever other servers allow access from you should come out of this list.

The list of processes should match up line-for-line to any servers running in your disaster recovery site. (Maybe some process related to the hardware you are running on might be different.) You will definitely get to see how well you have configured all of your system startup scripts.

Any issues with processes not starting properly may need to be addressed in the startup scripts of your primary environment. In particular, you must evaluate the cron jobs individually:

  • Is the time of day a job runs really meaningful?
  • Do you need to change something because the server will run in a different time zone?
  • Do any of the scripts called use a facility that is at the primary cloud computing environment? If so, this facility will need to be available at the disaster recovery environment.

Look at the file systems primarily for size issues: You don't want to suddenly end up with a full file system at your disaster recovery site.

Now, look through these lists and decide on the items that have to be replicated at the disaster recovery site. If you can narrow down this list, you should. When your list is ready, you can move on to the next step.

Keeping the disaster recovery site up to date

Soon you will be ready to create your image and ship it to the disaster recovery site. This entire process varies depending on your cloud provider.

You must also consider how often this process will run and how you will keep the disaster recovery site updated. Consider carefully how much time and data you can afford to lose versus how much you pay to make sure nothing is lost. Obviously, you don't want any work or data to be lost, but this surety comes with a price.

In my organization's case, we decided that we could live without a week's worth of data, so we make a complete virtual image once a month. This image is also sent to our disaster recovery site. We perform full backups every week and incremental backups daily. We decided to send the weekly full backups to the disaster recovery site. These backups don't need to be redundant at the primary site and we only pay a little more to send them over the Internet.

Steps for a full exercise

At this point, you have started a checklist and know where your alternate cloud servers will be running. You now need to run a beta test.

The full image of your production server can be copied over or migrated to the alternate cloud. You can run the alternate server at your convenience to make sure this part of the process works as expected. After ensuring that the process goes smoothly, there are still more steps to be ready for a full disaster recovery exercise.

A new network identity

The biggest change is the network identity of the disaster recovery server. Simply put, you have to use a different IP address for this server. You can keep all of your domain names, but their IP addresses have to change. This change leads to several issues, the most significant of which is changing the IP address of your domain name. (This is called the DNS A record.) You change the A record when a disaster recovery exercise is run and in an actual disaster.

Although the method used to update your A record varies, in general it consists of knowing the ID and password of the account at your DNS provider, as well as how to change records. Permanently reserve an IP address at your disaster recovery site and enter this IP address as a DNS entry. Giving it a name ensures that when the IP address is looked up and a valid record is returned.

For instance, if your website is www.agreatsite.com, give the disaster recovery server the permanent DNS record of something like alt-www.agreatesite.com or drwww.agreatesite.com. When the disaster recovery exercise is run (and in an actual disaster), you simply go into your DNS provider site and switch the IP address of www.agreatsite.com to the disaster recovery site's IP address: There is no reason to modify or delete the entry for alt-www.agreatsite.com. Having a DNS record can help when other sites or servers must also enter your disaster recovery server into their security settings.

Security settings for the new identity

Next share the disaster recovery IP address with any other employees, divisions, vendors, partners: Any entity that currently has your primary IP address in its security settings. This is one item that you will have to think about and research carefully. The security settings are needed for when your server initiates a connection to another system.

You may or may not (probably not) have existing rules in your firewall for these connections. Typically, your own servers are allowed to initiate a connection without restriction. Similarly, you may or may not have seen an active connection when you ran the netstat command. Perhaps this connection runs only as needed and is not scheduled via cron. For instance, you may manually send an update of some sort via secure transfer only on an as-needed basis.

Changes at the disaster recovery site

Finally, you need to know anything else that's different at the disaster recovery site. Be sure to consider the following items, and list the changes that will have to be made.

  • Time zone.
  • Storage for backups and archives.
  • Facilities at the cloud computing environment that need to be mimicked.
  • Changes to any scripts or code that refer to such facilities.
  • Changes to any scripts or code that use IP addresses rather than host names.

Make the changes before the disaster-recovery exercise, if possible, and be prepared to make them during the exercise if they can only be made at that time.

Running the disaster recovery exercise

You should now focus on actually running a full disaster recovery exercise: Just recording data and "thinking seriously" about this step won't cut it. Lay the steps out in order, then schedule your exercise. Your team members will have to agree on a date and time when little or no damage can come from having your site down for a short period. Warn the interested parties and ensure that there will be no conflict at the scheduled exercise time.

Let me emphasize: This is the only time you will ever know for sure that disaster is about to strike!

Let the exercise begin

The first step of the exercise is to change the DNS settings because the changes will take time to propagate. Then you can bring down your primary servers. But before starting the disaster recovery site, consider whether anything can be done quickly to mitigate the damage of losing the primary site.

Perhaps you have monitoring or alerting configured, such as with Nagios. If so, you can turn off the alarms. Also other systems may be depending on your primary server. What can you do about that? Anything that can be done quickly or can be handed off to someone else while you bring up the disaster recovery site should be done.

Head to the disaster recovery cloud

Now you can start your server at the disaster recovery site. Initiate things per the checklist you made earlier. Depending on how you chose to keep your image at the disaster recovery site, you may also need to restore a backup.

Finishing touches may be needed on your server after it boots. For example, you may have to modify the scripts that run on them to use the disaster recovery storage facilities. You will certainly have to drop or change the task that regularly creates your disaster recovery images. You may also have to vary the times at which your cron jobs run.

Run a functional test

After some delay to allow the DNS changes to propagate (2 to 4 hours is what we experienced), you can start testing things. Here you want to take the obvious path and do a bit of reverse engineering. Check the most obvious things first, like whether the websites are up and running. You should already have an RSS feed that lists any sites you have running in your cloud. If you don't, create that feed now. It should include sites that are public facing as well as those sites you use to administer the server, such as phpMyAdmin and the Drupal users login. Similarly, check your process monitoring. Is there something that was put in place temporarily that can now be undone? Maybe a process at another site had to be turned off and can now be turned back on.

Go back to the records you took at the beginning. Perform a close check to verify that all the processes and network connections are alive and well. From here, each organization will have a different set of tests to run to verify that the recovery was successful. If all has gone well, the only thing left to do is make sure any cron tasks are in place and see how they do over the next several days.

Document the details

Things should have gone well but probably not perfectly during this crucial testing. At this point, all the steps, including any things that might have been missed during the exercise, need to be wrapped up and documented. Then you get a chance to repeat the exercise and see things go perfectly.

Revert to the primary and do it all over again

Schedule the "Revert to the Primary" exercise for the next weekend. This exercise is the great part about running a true, full disaster recovery exercise: You actually get to run it twice, learn from any mistakes, and have things prepared perfectly if a real disaster strikes.

During this reversion exercise, you run through the entire exercise again. Schedule another planned outage and move everything back to normal. This time, you can be confident that all the items are in place and documented.

Be sure that a full review of the disaster recovery exercise is made. The exercise needs to be run at regular intervals, probably no less than once every two years but not more than every six months. Any system changes should always be evaluated as to whether they will require some changes to the disaster recovery plans.

Final wrap

The disaster recovery effort for every organization and site will be different. In this article, I've provided a good starting point as well as things to think about. Certainly some other items could be researched. For example, you may have more work or cost if you have SSL certificates tied to IP addresses. Maybe you can avoid editing scripts to run on the primary versus the disaster recovery site and simply add code to the scripts so that they detect where they are running. I plan to do this next time around and found the site www.whatismyip.com helpful. You can use the command:

wget http://www.whatismyip.com/automation/n09230945.asp -O public_ip.txt

to have just your public IP address returned and then use that IP address in a case statement in your scripts that need to change from one site to another.

The disaster recovery exercise also gives your operation another benefit: You now have a ready-to-run, complete, and up-to-date testing environment. On occasion (like a major upgrade to a software package), you may want to work out all the steps needed for a change to your environment before attempting the change in production. You can crank up the disaster recovery environment and check out what it takes to complete the steps — maybe even script the steps before making the change to your primary environment.

If you haven't started a disaster recovery plan, then now is the time to start. The cloud and virtual computing make it a lot simpler than the "old days." Good luck in your planning!


Example of using the ec2-migrate-manifest commandec2-ami.zip1KB



Get products and technologies

  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement SOA efficiently.


  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.


developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks

  • developerWorks Premium

    Exclusive tools to build your next great app. Learn more.

  • Cloud newsletter

    Crazy about Cloud? Sign up for our monthly newsletter and the latest cloud news.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

Zone=Cloud computing
ArticleTitle=Disaster recovery in a cloud environment