Build a cloud failover policy

Create a failover policy with cloud-specific riders that detail components and tasks

Within a few years, most organizations and institutions will either have moved a portion of their data to the cloud, will be in the process of transferring it, or will be deep into planning some amount of cloud usage — their two chief concerns will still be reliability and security. On the reliability front, organizations still require a minimum of 99.5 percent uptime. Unfortunately, many organizations still employ a "reactive" response when a failure occurs instead of taking more prudent proactive steps: The creation of a cloud failover policy with cloud-specific riders, each detailing components and tasks. The author provides a roadmap for such a policy and illustrates policy riders and scenarios of what proactive actions can be taken when failures happen.

Judith M. Myerson, Systems Engineer and Architect

Judith M. Myerson is a systems architect and engineer. Her areas of interest include enterprise-wide systems, middleware technologies, database technologies, cloud computing, threshold policies, industries, network management, security, RFID technologies, presentation management, and project management.



02 March 2012

Also available in Chinese Russian Japanese Vietnamese

The goal of a cloud failover policy is to ensure cloud service is available at least 99.5 percent of the time. It should focus on how a Software as a Service (SaaS) application, Platform as a Service (PaaS) platform, or Infrastructure as a Service (IaaS) virtual machines running in a cloud service can be transferred from a failing data center to a healthy one in order to provide nearly uninterrupted service to cloud customers.

Cloud users want to avoiding catastrophic failure of a cloud service such as the network outage that brought down the Amazon's U.S. (Northern Virginia) Region at least twice:

  • In 2007, the Amazon's Elastic Compute Cloud (EC2) Service was affected.
  • In 2011, Amazon's Cloud Computing PaaS solution, EC2, and Relational Database Service were affected.

Amazon's other two data center regions in the United States continued to run in good health.

At the North Virginia region, the storage volumes that Amazon's EC2 service used automatically created backups of themselves filling up storage capacity. When the automatic backups attempted to store themselves beyond the maximum limit of the storage capacity, the outage resulted. The network had nowhere to store more backups.

An outage of this type can cause customers to lose thousands of hours of data. This article shows you how to avoid those types of losses. It describes cloud failover policy, cloud-specific riders and scenarios of halting the damage, fixing the problem, restoring the system, and notifying clients.

SaaS-specific rider

Following are examples of what components and tasks should be included in a SaaS-specific rider for a proactive cloud failover policy.

Components

Due to the limited control a SaaS user has, the SaaS-specific rider contains a minimum of three components:

  • User control
  • User license
  • A failover plan

The only control a SaaS user has is to access the SaaS applications to perform business functions. The user license component, as negotiated with the provider, should specify maximum number of:

  • SaaS applications to access.
  • Concurrent users to access an application.
  • Requests allocated to each user.

The user license component should specify types of SaaS applications the users are allowed to access, such as finances, project management, customer relations, retail management and even vulnerability scanner with IBM Rational AppScan.

The failover plan component should state the SaaS application instances are in place to allow failover from one hosted center to another. It should indicate the provider offers both back up services and service level agreements (SLAs) and complies with such data privacy laws such as the Federal Privacy Act and Electronic Documents Act.

Tasks

At a minimum, the SaaS users should be allowed to:

  • Access the SaaS application up to the maximum number of access per user.
  • Update records based on the roles the users are assigned.
  • Receive security alerts.

Only the provider can:

  • Purchase a software upgrade license.
  • Manage patches to SaaS applications.
  • Access system applications and virtual machines.
  • Access the traditional computing Infrastructure underlying virtual machines.

When an SaaS application fails, the provider should specify how it can failover the SaaS application from the failing data center to a healthy one in the shortest time possible (with periods no longer than five minutes) and how it can restore the application after fixing problems.

Let's look at a SaaS failure scenario to illustrate this knowledge.


SaaS failure scenario

A company's on-premise CRM applications were successfully migrated as SaaS applications to an external provider hosting a multitenancy environment in data center regions in the United States. When a network glitch at one data center brought down the system for a couple of days — due to automatic backups of storage data filling up all storage capacity and leaving the network nowhere to store 200 million transactions from all over the world — an upsurge of complaints rolled in (as you would expect):

  • Salespeople howled.
  • Phones at help call centers rang loud and constant for days.
  • The provider groaned (when it was discovered that it could not get the SaaS CRM applications back in service within two minutes after the failure, as guaranteed in a SLA negotiated with the SaaS users).

Frantic for fast solutions, the provider put up a notice on its website, "Please wait ... we will have the system up soon." It was not soon enough.

It took a couple of days for the provider to get the SaaS applications working. The provider was not able to provide effective failover. Following are the proactive actions the provider can take to halt the damage, fix the problems, restore the system, and notify clients.

Halt the damage

To halt damage, plan ahead by preparing the SaaS applications as instances for automated failover. The SLA as negotiated between the SaaS subscriber and the provider should be in place. When performance falls far below the level of guaranteed service availability (99.9 percent), an alert should be sent to the provider. Performance that went below the level may be brought back to the guaranteed level nearly instantaneously depending on the traffic on the network.

Meanwhile SaaS clients are notified that the service is continued at another data center while the provider is fixing the problems. The provider will let them know when the service is restored at the originating data center.

(Note that there is a strong element of notification in all proactive actions described in these policies: Letting concerned parties know what is going on in your operations can often buy you more time to correct a problem; keeping customers in the dark about a legitimate situation will always work to your disadvantage.)

Fix the problem

The provider can plan ahead for the failover and have a cloud failover policy in place:

  • Install instances of the SaaS application to allow failover from one data center to another.
  • Periodically check to determine if backup tapes are working properly and free of defects.
  • Test to see if network software can fix the problem as it is intended to do.

Restore the system

The next step is to begin restoring the system at the data center where the system was brought down. The provider sets up a testing environment to test the restored system's resilience to ensure relatively smooth failover to another data center. For example, the provider randomly kills resources and services and notes the results.

When the testing is done, the provider backs up a copy of the restored system before moving it to a production environment.

Notify clients

As soon as the SaaS applications are restored in the data center that was brought down, the provider notifies its clients that the restoration was complete and the terms of the SLA (free credit, reimbursements, an opportunity to terminate) will be carried out.


PaaS-specific rider

What components and tasks should be included in a PaaS-specific rider for a cloud failover policy?

Components

PaaS developers have more control, so the PaaS-specific rider should contain a minimum of one more component than the SaaS rider — the ability to developer applications (the components for a PaaS developer are user control, the developer license, application development (not available for a SaaS user), and a failover plan.

A PaaS developer controls the development of SaaS applications and all the applications found in a full business life cycle; for example, spreadsheets, word processors, and backups.

The developer license component as negotiated with the provider should specify maximum number of:

  • Applications to develop.
  • Number of concurrent applications developers allowed.
  • Number of SaaS users accessing the application on PaaS allowed.

The application development component should specify the types of applications to develop, such as:

  • SaaS applications
  • Websites
  • CRM applications
  • Mobile applications
  • Billing and payroll applications
  • Service Delivery Platform applications (like mobile TV)
  • Content delivery management apps

A failover plan component should specify that PaaS application instances are in place to allow failover from one hosted center to another. The component should specify whether a PaaS developer would use her own or the provider's failover application, and whether or not she would be able to employ third-party tools for tasks such as load balancing. The plan should also indicate whether the provider offers both back up services and SLAs and whether it complies with data privacy laws.

Tasks

PaaS developers are allowed to:

  • Build, deploy, and run applications.
  • Manage patches and upgrades.
  • Determine which SaaS users can access SaaS applications co-existing on the PaaS.
  • Flexibly customize their platforms to react to local market conditions.

They are also allowed to:

  • Scan applications for vulnerabilities.
  • Configure application firewalls.
  • Build and monitor security alerts.
  • Perform back up operations.

At a minimum, only the provider can:

  • Run system applications.
  • Run virtual machines.
  • Access the traditional computing Infrastructure underlying virtual machines.

Let's look at a PaaS failure scenario to illustrate this knowledge.


PaaS failure scenario

A PaaS developer's "resource optimization" application fails. The failure causes this platform and other PaaS platforms hosted by the same provider to shut down completely. It could have been caused by failure to create replicated service instances that can survive host failures. Unfortunately, the "resource optimization" application did not identify failures nor implement short timeouts and quick retries.

Following are the proactive actions the provider can take by implementing the terms of the PaaS-specific rider negotiated with the provider on halting the damage, fixing the problems, restoring the system, and notifying clients.

Halt the damage

To halt damage, the PaaS developer builds simple services composed of a single host rather than multiple dependent hosts. This allows the developer to create replicated service instances that can survive host failures.

Fix the problem

When failures happen, the developer's software should quickly identify those failures and begin the failover process.

Let's suppose if the developer has a billing application that consists of business logic components ... A, B, and C ... he can compose a service group like this:

(A1, B1, C1), (A2, B2, C2) ... (An, Bn, Cn)

Where n is the number of component type representing the number of a 
service group category

        for service category 1, 
                A1 is the logic to find service code
                B1 is the logic to insert service category into the bill
                C1 is the logic to check the accuracy of zip codes
				
        for service category 2, 
                A2 is the logic to find service code
                B2 is the logic to insert service category into the bill
                C2 is the logic to check the accuracy of zip codes
				
        for service category n, 
                An is the logic to find service code
                Bn is the logic to insert service category into the bill
                Cn is the logic to check the accuracy of zip codes

If a single virtual machine running the PaaS fails, the failure results in the loss of the entire system group. This means if component A1 fails in one system group, the other two components, B1 and C1, fail. If there is more than one service group, the entire system fails.

To fix the problem, the developer decomposes the components into independent pools like this:

(A1, A2,...An ), (B1, B2...Bn ), (C1, C2, ...Cn)

So that he can create multiple redundant service copies at healthy data centers. This means if component A1 fails or slows down, all other components A2... An in the same independent pool do not fail. The second independent pools of components B1, B2...Bn and the third pool of components C1, C2, ...Cn do not fail.

The developer uses quick timeouts and retries of the slow services while the failover of all independent service pools of the billing application is in progress. The developer needs to determine when to halt timeouts and retries to avoid system lockup resulting from consumption of all resources waiting on slow or failed services.

Restore the system

The next step is to begin restoring the PaaS platform at the data center where the system was brought down. In a testing environment, the provider randomly halts resources and services underlying the PaaS. The developer tests the applications under varying conditions to test the PaaS's resilience. When the testing is done, the provider backs up the restored system before it is moved to a production environment.

Notify clients

As soon as the PaaS applications are running at the restored data center, the provider notifies PaaS developers on the restored system and the terms of the SLA are carried out.


IaaS-specific rider

Finally, what components and tasks should be included in an IaaS-specific rider for a cloud failover policy?

Components

Since IaaS is the foundation of the PaaS level, the IaaS-specific rider should contain a different set of components. Besides user control, enterprise license, and a failover plan, you'll find a component for virtual machines.

An IaaS specialist controls virtual machines; PaaS developers and SaaS users do not.

The enterprise license should specify maximum number of:

  • Virtual machines to develop, run, and maintain.
  • IaaS specialists to concurrently access the virtual machines.
  • PaaS developers to work on top of the IaaS virtual machines.

A failover plan should allow failover of IaaS virtual machines from one hosted center to another. It should allow IaaS infrastructure and network specialists to work with PaaS developers to set up failover virtual machines.

Tasks

The IaaS specialist can:

  • Develop, manage, and access virtual machines.
  • Authorize PaaS developers to develop applications on virtual machines.
  • Use third-party tools to increase performance (for example, a load-balancer) and protect system data.
  • Scan virtual machines for vulnerabilities.

These specialists are also allowed to:

  • Configure virtual machine firewalls.
  • Build and monitor security alerts.
  • Back up virtual machines.

Only the provider can access the infrastructure of traditional computing resources underlying the virtual machines. The IaaS specialist cannot.

Let's look at an IaaS failure scenario to illustrate this knowledge.


IaaS failure scenario

Virtual machines fail due to lack of additional resources needed for consumption at high I/O points. In other words, no virtual instances exist to allow failover to a healthy data center.

The provider can take the following proactive actions.

Halt the damage

To halt damage, plan how much resources are needed to run virtual machines at high I/O points. This can be done through capacity planning and performance threshold policies, for example. (There are other, more detailed articles in the Resources section that deal with threshold policies and capacity planning.)

Fix the problem

To fix the problem, establish performance threshold policies to determine when to provision the virtual machines needed during high I/O periods and how to ensure all of the service activation points are satisfied. Make sure your virtual machine instances at the data centers the host controls are in place.

Also, when failures happen, plan that they are quickly identified and that the provider transfers the IaaS -based virtual machine instances to another data center.

Restore the system

The next step is to begin restoring the IaaS platform at the data center where the system was brought down. In a testing environment, the provider can randomly halt resources and services underlying the IaaS. After testing is done, the provider backs up the restored system before it is moved to a production environment.

Notify clients

As soon as the virtual machines are running at the restored data center, the provider notifies IaaS specialists on the complete restoration of virtual machines at the data center.


In conclusion

The anatomy of an effective cloud failover policy requires proactive policy planning to:

  • Determine why cloud failures happen.
  • Identify those failures.
  • Prepare cloud-specific riders in the policy to proactively counteract the variables responsible for those failures.

Your team of developers, users, and providers need to work together to draw up the policy and riders. The team will find resolving the issues of what elements, components, and tasks should be included makes their job of crafting the policy much easier.

In this article, I've covered the basic components and tasks that should be considered as part of a failover policy in each cloud computing model — SaaS, PaaS, and IaaS. I've also provided suggestions on how the provider can proceed to halt the damage, fix the problem, restore the system, and of course, notify the client, for each model.

Other articles (in Resources) dive deeper into defining the policy elements of purpose, scope, background, actions and constraints that will help you draft cloud computing policies to deal with threshold triggers, workload balancing, general security, mobile access, application migration, service risk management, performance metrics, and more. In all the policy instructions I've provided, reliability and security are always included as key underlying components.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks


  • Bluemix Developers Community

    Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

  • developerWorks Labs

    Experiment with new directions in software development.

  • DevOps Services

    Software development in the cloud. Register today to create a project.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cloud computing, SOA and web services
ArticleID=800305
ArticleTitle=Build a cloud failover policy
publish-date=03022012