Many small companies are taking advantage of the simplicity provided by moving their infrastructure into a cloud computing environment. Many don't realize, however, that they should still prepare a disaster recovery plan for that infrastructure. The cloud changes what's needed for disaster recovery; it's different that what you need for server-client and network infrastructures. Those differences and the steps you need to take are the topic of this article.
My own small organization, the Educopia Institute — an organization for planning and implementing shared cyberinfrastructure projects for scholarly communication, recently went through a complete disaster recovery exercise and found that the cloud computing environment made the process quite simple. In this article, discover the steps we took.
Note: The information in this article also applies if you have a physical production environment but run your development environment in the cloud.
By definition, running in the cloud means that you are using a virtual server. It is much easier and less expensive to make a ready-to-run copy of a virtual server than it is a physical server: You don't need extra hardware for a disaster recovery instance of the server, although you will need to have the image stored at the disaster recovery cloud vendor. A virtual server loads and runs from scratch much more quickly than you could ever rebuild a physical server from scratch. This ready-to-run copy of your virtual servers is at the heart of your disaster recovery plan.
To get started on your disaster recovery plan, you need to have a second physical location that can run your server image. This location should be at least a few hundred miles from the location of your primary servers. Because your personnel are already connecting to the server via the Internet, you won't need to move your team to the disaster site.
My organization uses Amazon Elastic Compute Cloud (Amazon EC2) and it is relatively
straightforward to move from one Amazon data center to another. The magic
command needed in the Amazon cloud is
If you are moving from one vendor's data center to another vendor's data center,
however, more is involved than if you are going from one data center to another
data center from the same vendor. There are many cloud computing environment vendors;
you have to determine your best alternative for disaster recovery. In either
case, whether you use Amazon EC2 or another vendor, you must
first create a small proof of concept before committing to a full-blown disaster
In this proof of concept, you need to copy or migrate a complete image of a virtual server from the primary data center to the disaster recovery data center. Initially, this image need not be the latest complete image of a production server: You are just trying to prove that you can re-create a server from one cloud in another cloud's location. The proof of concept does not need to run live data or have a real web address.
No large commitment is needed here, either. It is inexpensive to set up some storage space and run a small server at all the cloud computing environment vendors. If you can, use your current cloud computing environment for disaster recovery; then you are that much closer. When you have a successful run of the proof of concept, you are ready to continue with your plan.
Remember that you probably won't use this vendor or site anytime soon: You just need to keep a close enough eye to know if it is going out of business. This disaster recovery vendor will get an image delivery from you on a regular basis; this is how you know if something is breaking down.
Obviously, you need to know and document what you actually have running and set up in your cloud and this effort means you need to gather data.
At the end of this effort, carefully evaluate what must move to the disaster recovery site. This is not a simple case of "just move everything": You may have features or functions that are in place for testing purposes or are not critical and can be recovered later.
How do you go about finding out everything that you actually have in your cloud? You might have it documented; verify that the documentation is up to date. Login to a shell on your running production server and be sure it is operating normally. Perform the following steps to create files that help you record the items on your server (the commands shown should work on any variant of Red Hat Linux®):
- Record the process that are running:
ps –ef > /tmp/procs.txt
- Determine the active connections on your server:
netstat -an > /tmp/connects.txt
- Determine the file systems on your server:
df -ah > /tmp/mounts.txt
- Record the running
cd /var/spool/cron more * > /tmp/crons.txt
Because you are moving a virtual server, not rebuilding a server from scratch, there is no need to identify every software package and every module (like Apache or Perl modules) or Ruby GEM on your system. All these elements will be there because you are copying virtual images.
This list of connections will help you determine the security and firewall settings needed at the disaster recovery site. Also important: Whatever other servers you allow access to and whatever other servers allow access from you should come out of this list.
The list of processes should match up line-for-line to any servers running in your disaster recovery site. (Maybe some process related to the hardware you are running on might be different.) You will definitely get to see how well you have configured all of your system startup scripts.
Any issues with processes
not starting properly may need to be addressed in the startup scripts of your primary
environment. In particular, you must evaluate the
- Is the time of day a job runs really meaningful?
- Do you need to change something because the server will run in a different time zone?
- Do any of the scripts called use a facility that is at the primary cloud computing environment? If so, this facility will need to be available at the disaster recovery environment.
Look at the file systems primarily for size issues: You don't want to suddenly end up with a full file system at your disaster recovery site.
Now, look through these lists and decide on the items that have to be replicated at the disaster recovery site. If you can narrow down this list, you should. When your list is ready, you can move on to the next step.
Soon you will be ready to create your image and ship it to the disaster recovery site. This entire process varies depending on your cloud provider.
You must also consider how often this process will run and how you will keep the disaster recovery site updated. Consider carefully how much time and data you can afford to lose versus how much you pay to make sure nothing is lost. Obviously, you don't want any work or data to be lost, but this surety comes with a price.
In my organization's case, we decided that we could live without a week's worth of data, so we make a complete virtual image once a month. This image is also sent to our disaster recovery site. We perform full backups every week and incremental backups daily. We decided to send the weekly full backups to the disaster recovery site. These backups don't need to be redundant at the primary site and we only pay a little more to send them over the Internet.
At this point, you have started a checklist and know where your alternate cloud servers will be running. You now need to run a beta test.
The full image of your production server can be copied over or migrated to the alternate cloud. You can run the alternate server at your convenience to make sure this part of the process works as expected. After ensuring that the process goes smoothly, there are still more steps to be ready for a full disaster recovery exercise.
The biggest change is the network identity of the disaster recovery server. Simply put, you have to use a different IP address for this server. You can keep all of your domain names, but their IP addresses have to change. This change leads to several issues, the most significant of which is changing the IP address of your domain name. (This is called the DNS A record.) You change the A record when a disaster recovery exercise is run and in an actual disaster.
Although the method used to update your A record varies, in general it consists of knowing the ID and password of the account at your DNS provider, as well as how to change records. Permanently reserve an IP address at your disaster recovery site and enter this IP address as a DNS entry. Giving it a name ensures that when the IP address is looked up and a valid record is returned.
instance, if your website is
www.agreatsite.com, give the disaster recovery
server the permanent DNS record of something like
drwww.agreatesite.com. When the disaster recovery exercise is
run (and in an actual disaster), you simply go into your DNS provider site and
switch the IP address of www.agreatsite.com to the disaster recovery site's IP
address: There is no reason to modify or delete the entry for alt-www.agreatsite.com.
Having a DNS record can help when other sites or servers must also enter your
disaster recovery server into their security settings.
Next share the disaster recovery IP address with any other employees, divisions, vendors, partners: Any entity that currently has your primary IP address in its security settings. This is one item that you will have to think about and research carefully. The security settings are needed for when your server initiates a connection to another system.
You may or may not (probably not) have existing rules in your firewall for these
connections. Typically, your own servers are allowed to initiate a connection
without restriction. Similarly, you may or may not have seen an active connection
when you ran the
netstat command. Perhaps this
connection runs only as needed and is not scheduled via
For instance, you may manually send an update of some sort via secure transfer
only on an as-needed basis.
Finally, you need to know anything else that's different at the disaster recovery site. Be sure to consider the following items, and list the changes that will have to be made.
- Time zone.
- Storage for backups and archives.
- Facilities at the cloud computing environment that need to be mimicked.
- Changes to any scripts or code that refer to such facilities.
- Changes to any scripts or code that use IP addresses rather than host names.
Make the changes before the disaster-recovery exercise, if possible, and be prepared to make them during the exercise if they can only be made at that time.
You should now focus on actually running a full disaster recovery exercise: Just recording data and "thinking seriously" about this step won't cut it. Lay the steps out in order, then schedule your exercise. Your team members will have to agree on a date and time when little or no damage can come from having your site down for a short period. Warn the interested parties and ensure that there will be no conflict at the scheduled exercise time.
Let me emphasize: This is the only time you will ever know for sure that disaster is about to strike!
The first step of the exercise is to change the DNS settings because the changes will take time to propagate. Then you can bring down your primary servers. But before starting the disaster recovery site, consider whether anything can be done quickly to mitigate the damage of losing the primary site.
Perhaps you have monitoring or alerting configured, such as with Nagios. If so, you can turn off the alarms. Also other systems may be depending on your primary server. What can you do about that? Anything that can be done quickly or can be handed off to someone else while you bring up the disaster recovery site should be done.
Now you can start your server at the disaster recovery site. Initiate things per the checklist you made earlier. Depending on how you chose to keep your image at the disaster recovery site, you may also need to restore a backup.
Finishing touches may be needed on your server after it boots. For example,
you may have to modify the scripts that run on them to use the disaster recovery
storage facilities. You will certainly have to drop or change the task that
regularly creates your disaster recovery images. You may also have to vary the
times at which your
cron jobs run.
After some delay to allow the DNS changes to propagate (2 to 4 hours is what we experienced), you can start testing things. Here you want to take the obvious path and do a bit of reverse engineering. Check the most obvious things first, like whether the websites are up and running. You should already have an RSS feed that lists any sites you have running in your cloud. If you don't, create that feed now. It should include sites that are public facing as well as those sites you use to administer the server, such as phpMyAdmin and the Drupal users login. Similarly, check your process monitoring. Is there something that was put in place temporarily that can now be undone? Maybe a process at another site had to be turned off and can now be turned back on.
Go back to the records you took at the beginning. Perform a close check to verify
that all the processes and network connections are alive and well. From here,
each organization will have a different set of tests to run to verify that the
recovery was successful. If all has gone well, the only thing left to do is make
cron tasks are in place and see how they
do over the next several days.
Things should have gone well but probably not perfectly during this crucial testing. At this point, all the steps, including any things that might have been missed during the exercise, need to be wrapped up and documented. Then you get a chance to repeat the exercise and see things go perfectly.
Schedule the "Revert to the Primary" exercise for the next weekend. This exercise is the great part about running a true, full disaster recovery exercise: You actually get to run it twice, learn from any mistakes, and have things prepared perfectly if a real disaster strikes.
During this reversion exercise, you run through the entire exercise again. Schedule another planned outage and move everything back to normal. This time, you can be confident that all the items are in place and documented.
Be sure that a full review of the disaster recovery exercise is made. The exercise needs to be run at regular intervals, probably no less than once every two years but not more than every six months. Any system changes should always be evaluated as to whether they will require some changes to the disaster recovery plans.
The disaster recovery effort for every organization and site will be different. In this article, I've provided a good starting point as well as things to think about. Certainly some other items could be researched. For example, you may have more work or cost if you have SSL certificates tied to IP addresses. Maybe you can avoid editing scripts to run on the primary versus the disaster recovery site and simply add code to the scripts so that they detect where they are running. I plan to do this next time around and found the site www.whatismyip.com helpful. You can use the command:
wget http://www.whatismyip.com/automation/n09230945.asp -O public_ip.txt
to have just your public IP address returned and then use that IP address in a
case statement in your scripts that need to change from
one site to another.
The disaster recovery exercise also gives your operation another benefit: You now have a ready-to-run, complete, and up-to-date testing environment. On occasion (like a major upgrade to a software package), you may want to work out all the steps needed for a change to your environment before attempting the change in production. You can crank up the disaster recovery environment and check out what it takes to complete the steps — maybe even script the steps before making the change to your primary environment.
If you haven't started a disaster recovery plan, then now is the time to start. The cloud and virtual computing make it a lot simpler than the "old days." Good luck in your planning!
|Example of using the ec2-migrate-manifest command||ec2-ami.zip||1KB||HTTP|
Learn more about Amazon's cloud offering, Amazon EC2.
Learn about migrating to the Amazon cloud in the developerWorks article,
your Linux application to the Amazon cloud, Part 1: Initial migration (Sean
A. Walberg, July 2010).
Explore developerWorks Cloud
Computing zone, where you can find valuable community discussions and
learn about new technical resources related to the cloud.
In IBM Smart Business Cloud Computing,
get valuable business advise to enhance performance and efficiency in the cloud.
Computing—A Primer for a basic understanding of cloud computing.
Follow developerWorks on Twitter.
on-demand demos ranging from product installation and setup demos for
beginners to advanced functionality for experienced developers.
Get products and technologies
Evaluate IBM products
in the way that suits you best: Download a product trial, try a product online,
use a product in a cloud environment, or spend a few hours in the
Sandbox learning how to implement SOA efficiently.
Get involved in the My
developerWorks community. Connect with other developerWorks users while
exploring the developer-driven blogs, forums, groups, and wikis.
Bill Robbins is the systems administrator for the Educopia Institute, a nonprofit organization that runs a Linux®-based infrastructure on the Amazon EC2 cloud. He holds a Masters of Science degree in electrical engineering and a Bachelors Degree in the same from the Georgia Institute of Technology. Prior to joining Educopia in 2008, he worked in IT and network management at BellSouth and Emory University and as a design engineer for telecommunications companies in Florida and Georgia. He has worked with many varieties of UNIX since before there were graphical terminals.