The goal of a cloud failover policy is to ensure cloud service is available at least 99.5 percent of the time. It should focus on how a Software as a Service (SaaS) application, Platform as a Service (PaaS) platform, or Infrastructure as a Service (IaaS) virtual machines running in a cloud service can be transferred from a failing data center to a healthy one in order to provide nearly uninterrupted service to cloud customers.
Cloud users want to avoiding catastrophic failure of a cloud service such as the network outage that brought down the Amazon's U.S. (Northern Virginia) Region at least twice:
- In 2007, the Amazon's Elastic Compute Cloud (EC2) Service was affected.
- In 2011, Amazon's Cloud Computing PaaS solution, EC2, and Relational Database Service were affected.
Amazon's other two data center regions in the United States continued to run in good health.
At the North Virginia region, the storage volumes that Amazon's EC2 service used automatically created backups of themselves filling up storage capacity. When the automatic backups attempted to store themselves beyond the maximum limit of the storage capacity, the outage resulted. The network had nowhere to store more backups.
An outage of this type can cause customers to lose thousands of hours of data. This article shows you how to avoid those types of losses. It describes cloud failover policy, cloud-specific riders and scenarios of halting the damage, fixing the problem, restoring the system, and notifying clients.
Following are examples of what components and tasks should be included in a SaaS-specific rider for a proactive cloud failover policy.
Due to the limited control a SaaS user has, the SaaS-specific rider contains a minimum of three components:
- User control
- User license
- A failover plan
The only control a SaaS user has is to access the SaaS applications to perform business functions. The user license component, as negotiated with the provider, should specify maximum number of:
- SaaS applications to access.
- Concurrent users to access an application.
- Requests allocated to each user.
The user license component should specify types of SaaS applications the users are allowed to access, such as finances, project management, customer relations, retail management and even vulnerability scanner with IBM Rational AppScan.
The failover plan component should state the SaaS application instances are in place to allow failover from one hosted center to another. It should indicate the provider offers both back up services and service level agreements (SLAs) and complies with such data privacy laws such as the Federal Privacy Act and Electronic Documents Act.
At a minimum, the SaaS users should be allowed to:
- Access the SaaS application up to the maximum number of access per user.
- Update records based on the roles the users are assigned.
- Receive security alerts.
Only the provider can:
- Purchase a software upgrade license.
- Manage patches to SaaS applications.
- Access system applications and virtual machines.
- Access the traditional computing Infrastructure underlying virtual machines.
When an SaaS application fails, the provider should specify how it can failover the SaaS application from the failing data center to a healthy one in the shortest time possible (with periods no longer than five minutes) and how it can restore the application after fixing problems.
Let's look at a SaaS failure scenario to illustrate this knowledge.
A company's on-premise CRM applications were successfully migrated as SaaS applications to an external provider hosting a multitenancy environment in data center regions in the United States. When a network glitch at one data center brought down the system for a couple of days — due to automatic backups of storage data filling up all storage capacity and leaving the network nowhere to store 200 million transactions from all over the world — an upsurge of complaints rolled in (as you would expect):
- Salespeople howled.
- Phones at help call centers rang loud and constant for days.
- The provider groaned (when it was discovered that it could not get the SaaS CRM applications back in service within two minutes after the failure, as guaranteed in a SLA negotiated with the SaaS users).
Frantic for fast solutions, the provider put up a notice on its website, "Please wait ... we will have the system up soon." It was not soon enough.
It took a couple of days for the provider to get the SaaS applications working. The provider was not able to provide effective failover. Following are the proactive actions the provider can take to halt the damage, fix the problems, restore the system, and notify clients.
To halt damage, plan ahead by preparing the SaaS applications as instances for automated failover. The SLA as negotiated between the SaaS subscriber and the provider should be in place. When performance falls far below the level of guaranteed service availability (99.9 percent), an alert should be sent to the provider. Performance that went below the level may be brought back to the guaranteed level nearly instantaneously depending on the traffic on the network.
Meanwhile SaaS clients are notified that the service is continued at another data center while the provider is fixing the problems. The provider will let them know when the service is restored at the originating data center.
(Note that there is a strong element of notification in all proactive actions described in these policies: Letting concerned parties know what is going on in your operations can often buy you more time to correct a problem; keeping customers in the dark about a legitimate situation will always work to your disadvantage.)
The provider can plan ahead for the failover and have a cloud failover policy in place:
- Install instances of the SaaS application to allow failover from one data center to another.
- Periodically check to determine if backup tapes are working properly and free of defects.
- Test to see if network software can fix the problem as it is intended to do.
The next step is to begin restoring the system at the data center where the system was brought down. The provider sets up a testing environment to test the restored system's resilience to ensure relatively smooth failover to another data center. For example, the provider randomly kills resources and services and notes the results.
When the testing is done, the provider backs up a copy of the restored system before moving it to a production environment.
As soon as the SaaS applications are restored in the data center that was brought down, the provider notifies its clients that the restoration was complete and the terms of the SLA (free credit, reimbursements, an opportunity to terminate) will be carried out.
What components and tasks should be included in a PaaS-specific rider for a cloud failover policy?
PaaS developers have more control, so the PaaS-specific rider should contain a minimum of one more component than the SaaS rider — the ability to developer applications (the components for a PaaS developer are user control, the developer license, application development (not available for a SaaS user), and a failover plan.
A PaaS developer controls the development of SaaS applications and all the applications found in a full business life cycle; for example, spreadsheets, word processors, and backups.
The developer license component as negotiated with the provider should specify maximum number of:
- Applications to develop.
- Number of concurrent applications developers allowed.
- Number of SaaS users accessing the application on PaaS allowed.
The application development component should specify the types of applications to develop, such as:
- SaaS applications
- CRM applications
- Mobile applications
- Billing and payroll applications
- Service Delivery Platform applications (like mobile TV)
- Content delivery management apps
A failover plan component should specify that PaaS application instances are in place to allow failover from one hosted center to another. The component should specify whether a PaaS developer would use her own or the provider's failover application, and whether or not she would be able to employ third-party tools for tasks such as load balancing. The plan should also indicate whether the provider offers both back up services and SLAs and whether it complies with data privacy laws.
PaaS developers are allowed to:
- Build, deploy, and run applications.
- Manage patches and upgrades.
- Determine which SaaS users can access SaaS applications co-existing on the PaaS.
- Flexibly customize their platforms to react to local market conditions.
They are also allowed to:
- Scan applications for vulnerabilities.
- Configure application firewalls.
- Build and monitor security alerts.
- Perform back up operations.
At a minimum, only the provider can:
- Run system applications.
- Run virtual machines.
- Access the traditional computing Infrastructure underlying virtual machines.
Let's look at a PaaS failure scenario to illustrate this knowledge.
A PaaS developer's "resource optimization" application fails. The failure causes this platform and other PaaS platforms hosted by the same provider to shut down completely. It could have been caused by failure to create replicated service instances that can survive host failures. Unfortunately, the "resource optimization" application did not identify failures nor implement short timeouts and quick retries.
Following are the proactive actions the provider can take by implementing the terms of the PaaS-specific rider negotiated with the provider on halting the damage, fixing the problems, restoring the system, and notifying clients.
To halt damage, the PaaS developer builds simple services composed of a single host rather than multiple dependent hosts. This allows the developer to create replicated service instances that can survive host failures.
When failures happen, the developer's software should quickly identify those failures and begin the failover process.
Let's suppose if the developer has a billing application that consists of business logic components ... A, B, and C ... he can compose a service group like this:
(A1, B1, C1), (A2, B2, C2) ... (An, Bn, Cn) Where n is the number of component type representing the number of a service group category for service category 1, A1 is the logic to find service code B1 is the logic to insert service category into the bill C1 is the logic to check the accuracy of zip codes for service category 2, A2 is the logic to find service code B2 is the logic to insert service category into the bill C2 is the logic to check the accuracy of zip codes for service category n, An is the logic to find service code Bn is the logic to insert service category into the bill Cn is the logic to check the accuracy of zip codes
If a single virtual machine running the PaaS fails, the failure results in the loss of the
entire system group. This means if component
A1 fails in one system group, the other two components,
C1, fail. If there is more than one service group, the entire system fails.
To fix the problem, the developer decomposes the components into independent pools like this:
(A1, A2,...An ), (B1, B2...Bn ), (C1, C2, ...Cn)
So that he can create multiple redundant service copies at healthy data centers. This
means if component
A1 fails or slows down, all other
A2... An in the same independent pool do not
fail. The second independent pools of components
B2...Bn and the third pool of components
C1, C2, ...Cn do not fail.
The developer uses quick timeouts and retries of the slow services while the failover of all independent service pools of the billing application is in progress. The developer needs to determine when to halt timeouts and retries to avoid system lockup resulting from consumption of all resources waiting on slow or failed services.
The next step is to begin restoring the PaaS platform at the data center where the system was brought down. In a testing environment, the provider randomly halts resources and services underlying the PaaS. The developer tests the applications under varying conditions to test the PaaS's resilience. When the testing is done, the provider backs up the restored system before it is moved to a production environment.
As soon as the PaaS applications are running at the restored data center, the provider notifies PaaS developers on the restored system and the terms of the SLA are carried out.
Finally, what components and tasks should be included in an IaaS-specific rider for a cloud failover policy?
Since IaaS is the foundation of the PaaS level, the IaaS-specific rider should contain a different set of components. Besides user control, enterprise license, and a failover plan, you'll find a component for virtual machines.
An IaaS specialist controls virtual machines; PaaS developers and SaaS users do not.
The enterprise license should specify maximum number of:
- Virtual machines to develop, run, and maintain.
- IaaS specialists to concurrently access the virtual machines.
- PaaS developers to work on top of the IaaS virtual machines.
A failover plan should allow failover of IaaS virtual machines from one hosted center to another. It should allow IaaS infrastructure and network specialists to work with PaaS developers to set up failover virtual machines.
The IaaS specialist can:
- Develop, manage, and access virtual machines.
- Authorize PaaS developers to develop applications on virtual machines.
- Use third-party tools to increase performance (for example, a load-balancer) and protect system data.
- Scan virtual machines for vulnerabilities.
These specialists are also allowed to:
- Configure virtual machine firewalls.
- Build and monitor security alerts.
- Back up virtual machines.
Only the provider can access the infrastructure of traditional computing resources underlying the virtual machines. The IaaS specialist cannot.
Let's look at an IaaS failure scenario to illustrate this knowledge.
Virtual machines fail due to lack of additional resources needed for consumption at high I/O points. In other words, no virtual instances exist to allow failover to a healthy data center.
The provider can take the following proactive actions.
To halt damage, plan how much resources are needed to run virtual machines at high I/O points. This can be done through capacity planning and performance threshold policies, for example. (There are other, more detailed articles in the Resources section that deal with threshold policies and capacity planning.)
To fix the problem, establish performance threshold policies to determine when to provision the virtual machines needed during high I/O periods and how to ensure all of the service activation points are satisfied. Make sure your virtual machine instances at the data centers the host controls are in place.
Also, when failures happen, plan that they are quickly identified and that the provider transfers the IaaS -based virtual machine instances to another data center.
The next step is to begin restoring the IaaS platform at the data center where the system was brought down. In a testing environment, the provider can randomly halt resources and services underlying the IaaS. After testing is done, the provider backs up the restored system before it is moved to a production environment.
As soon as the virtual machines are running at the restored data center, the provider notifies IaaS specialists on the complete restoration of virtual machines at the data center.
The anatomy of an effective cloud failover policy requires proactive policy planning to:
- Determine why cloud failures happen.
- Identify those failures.
- Prepare cloud-specific riders in the policy to proactively counteract the variables responsible for those failures.
Your team of developers, users, and providers need to work together to draw up the policy and riders. The team will find resolving the issues of what elements, components, and tasks should be included makes their job of crafting the policy much easier.
In this article, I've covered the basic components and tasks that should be considered as part of a failover policy in each cloud computing model — SaaS, PaaS, and IaaS. I've also provided suggestions on how the provider can proceed to halt the damage, fix the problem, restore the system, and of course, notify the client, for each model.
Other articles (in Resources) dive deeper into defining the policy elements of purpose, scope, background, actions and constraints that will help you draft cloud computing policies to deal with threshold triggers, workload balancing, general security, mobile access, application migration, service risk management, performance metrics, and more. In all the policy instructions I've provided, reliability and security are always included as key underlying components.
Learn more about capacity planning for cloud implementation in "Cloud success secret: Flexible capacity planning."
These articles on policy creation from the author will take you to the next level:
- "Craft security policy for mobile devices." (Security, mobile)
- "Balance workload in a cloud environment: Use threshold policies to dynamically balance workload demands." (Threshold, workload)
- "Cloud computing versus grid computing: Service types, similarities and differences, and things to consider." (More threshold policies)
- "Build proactive threshold policies on the cloud." (Threshold)
- "Change app behavior: From in house to the cloud." (App migration)
- "Cloud services: Mitigate risks, maintain availability." (Service risk mitigation, reliability)
- "Craft a cloud performance metrics policy." (Performance metrics, testing, monitoring)
In the developerWorks cloud developer resources, discover and share knowledge and experience of application and services developers building their projects for cloud deployment.
More developerWorks resources that match this article can be found at SOA and web services at developerWorks and industries at developerWorks.
Find out how to access IBM SmartCloud Enterprise.
Get products and technologies
See the product images available for IBM SmartCloud Enterprise.
Join a cloud computing group on developerWorks.
Read all the great cloud blogs on developerWorks.
Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.
Judith M. Myerson is a systems architect and engineer. Her areas of interest include enterprise-wide systems, middleware technologies, database technologies, cloud computing, threshold policies, industries, network management, security, RFID technologies, presentation management, and project management.