On a clear day in cloud land (an island), all citizens — users, developers and infrastructure specialists — are happy. The systems run smoothly without issues. The users use remote desktop control to connect to their virtual desktops. Once connected, they access Software as a Service (SaaS) applications and receive quick responses. All virtual desktops have been created with an image of the physical desktop deployed to a server running a hypervisor.
The developers have all resources they need to develop applications on the Platform as a Service (PaaS). They use virtual desktops to access the platform. The infrastructure specialists access the Infrastructure as a Service (IaaS) to manage virtual machines by the same physical host.
Then one day the islanders see dark clouds on the horizon. The slow-moving clouds get bigger and bigger as they approach the island. Once the clouds arrive, the clouds completely cover the island's sky. The next thing the unhappy islanders see are the cloud outages that consume all cloud service resources little by little until there are no more resources for the islanders to consume. As the dark clouds leave the island, the islanders groan and moan for not being well prepared for the cloud service outages. Cloud service performance falls way below the guaranteed levels of service availability in the service level agreements (SLA).
Cloud outage risks
Cloud outage risks are most likely to occur when a threat agent can take advantage of the cloud vulnerability. All organizations are vulnerable to the consequences of cloud outages as a result of resource exhaustion. Types of failures triggering the outages include:
- Leap year failure
- Numerically unstable algorithms
- Resource optimization failure
- Threshold policy implementation failure
- Hypervisor failure
- Virtual desktop failure
Leap year failure
The security certificate issuing server in Microsoft's Azure cloud failed to make the leap and didn't recognize the date Feb 29, 2012, causing a cloud outage. The inability of the certificate server to issue a proper certificate prevented virtual machines from starting. The server's host agent read that as a likely hardware problem and so reported it to the cloud's cluster controller in order to move the virtual machine to other hardware. That resiliency measure then moved a virtual machine that couldn't be started to other healthy hardware where again the host again would send in the same false hardware failure report, spreading a malady that might have otherwise limited and corrected before it could cause a cloud service outage.
Numerically unstable algorithm
A numerically unstable algorithm causes endless loops of resource consumption in an attempt to solve a numerical problem until there are no more resources to consume. As the resources shrink, the cloud performance keeps slowing down until it results in a cloud outage.
In a simplistic scenario, computing the square root of 2 (which is roughly 1.41421) is a well-posed problem. Many algorithms solve this problem by starting with an initial approximation x1 to 1.4 (and some with x1 to 1.42) and then computing improved guesses x2, x3, etc. The way a numerical algorithm is set up can influence how fast the result from the method can converge. One algorithm can cause the result to converge iteratively faster than the second algorithm can.
If the result from one iteration method converges fast for initial approximations, it is considered numerically stable. If the second iteration method converges slowly for one initial approximation and diverges greatly for another initial approximation, it is considered numerically unstable and requires additional resources to consume.
Consider two iteration methods: The Babylonian method and Method X are shown in Table 1.
Table 1. Two iteration methods
|Babylonian||Babylonian||Method X||Method X|
|x1 = 1.4||x1 = 1.42||x1 = 1.4||x1 = 1.42|
|x2 = 1.4142857...||x2 = 1.41422535...||x2 = 1.4016||x2 = 1.42026896|
|x3 = 1.414213564...||x3 = 1.41421356242...||x3 = 1.4028614...||x3 = 1.42056...|
|x1000000 = 1.41421...||x28 = 7280.2284...|
Babylonian method is given by xk+1 = xk/2 + 1/xk. Method X is given by xk + 1 = (xk2-2)2 + xk. I have calculated a few iterations of each scheme in the table with initial guesses x1 = 1.4 and x1 = 1.42.
Observe that the Babylonian method converges fast regardless of the initial guess, whereas Method X converges extremely slowly with initial guess 1.4 and diverges for initial guess 1.42 (that is, 7280.2284...) . Hence, the Babylonian method is numerically stable, while Method X is numerically unstable.
Resource optimization failure
The PaaS developer's "cloud resource optimization" application fails. The failure causes this platform and other PaaS platforms hosted by the same provider to shut down completely. The "cloud resource optimization" application did not identify failures nor implement short timeouts.
The PaaS developer did not build simple services composed of a single host; instead, he built them composed of multiple dependent hosts. The single host would allow the developer to create replicated service instances that can survive host failures.
Let's suppose if the developer has a billing application that consists of business logic components ... A, B, and C ... he can compose a service group like this:
(A1, B1, C1), (A2, B2, C2) ... (An, Bn, Cn)
Where n is the number of component type representing the number of a service group category.
- For service category 1:
- A1 is the logic to find service code
- B1 is the logic to insert service category into the bill
- C1 is the logic to check the accuracy of zip codes
- For service category 2:
- A2 is the logic to find service code
- B2 is the logic to insert service category into the bill
- C2 is the logic to check the accuracy of zip codes
- For service category n:
- An is the logic to find service code
- Bn is the logic to insert service category into the bill
- Cn is the logic to check the accuracy of zip codes
If a single virtual machine running the PaaS fails, the failure results in the loss of the entire system group. This means if component A1 fails in one system group, the other two components, B1 and C1, will fail. If there is more than one service group, the entire system will fail.
To fix the problem, the developer decomposes the components into independent pools like this:
(A1, A2,...An ), (B1, B2...Bn ), (C1, C2, ...Cn)
So that he can create multiple redundant service copies at healthy data centers. This means if component A1 fails or slows down, all other components A2... An in the same independent pool will not fail. The second independent pools of components B1, B2...Bn and the third pool of components C1, C2, ...Cn will not fail.
Threshold policy implementation failure
The cloud fails. Threshold policies on cloud performance are not implemented or are in place. The provider fails to implement the following policies for SaaS, PaaS and IaaS:
- Resource threshold to ensure resource consumption is balanced dynamically for applications in the cloud below or at the threshold level. Failure to implement this policy produces domino effects on other threshold policies.
- User threshold to ensure users can access concurrently the application up to the limit specified in a user license below or at the threshold level.
- Data request threshold> to ensure data requests to the application can be processed immediately below or at the threshold level.
- Response threshold to ensure the application responds to a user or a data request in a timely manner below or at the threshold level.
The provider fails to negotiate with:
- PaaS developer on implementation of replicated service instance threshold policy to ensure service instances are replicated that can survive host failures below or at the threshold level.
- IaaS infrastructure specialist on implementation of virtual machine policy to ensure the number of virtual machines running on the same host is below or at the threshold level.
- IaaS infrastructure specialist on implementation of network latency threshold on sending a packet is below or at the threshold level. The latency includes queuing delay to hold multiple packets from different sources.
The provider fails to determine resource capacity of a physical server to host additional virtual machines when needed. The provider attempts to add virtual machines beyond the limits of the resources for this physical server. The provider fails to determine how virtual machines can stress a processor's cache memory harder than a physical server can, and how processors differ in their ability to switch between the demands of applications and hypervisors.
Resource capacity planning does not include information on varying ability of different chipsets to support virtual workloads and hypervisors in order to determine how well a server will perform as a virtual machine host. Also, the impacts of the resource-intensive applications on virtual servers may be missing from capacity planning. Too many (like six for example) virtual machines running these applications can contribute to a cloud outage. Default memory sizes may be insufficient for every core on a processor.
Virtual desktop failure
The provider fails to determine resource capacity of a hypervisor to correctly deploy virtual desktops (which are obtained after getting desktop images from a physical server). The provider attempts to add virtual desktops to the same hypervisor beyond the limits of virtual resources for this hypervisor. The provider fails to setup correctly a distributed virtual desktops that lets you deploy images to a specific group based on their location or job function.
Resource capacity planning does not include information on varying desktop requirements of different user groups in order to determine how well virtual desktops will perform on a hypervisor. There may be negative impacts of the resource-intensive applications on the virtual desktops which in turn could impact the hypervisor. If the hypervisor fails, the virtual desktops will fail.
Expectations on risk mitigation
To measure how well risks of cloud outages are mitigated, you should look at three expectation types: User, developer, and technology maintainer. Let's look underneath the user interface to discover what developers and technology maintainers are doing to make those user expectations occur.
Since everything is running smoothly as it should be, users can expect that they can access the SaaS application up to the limit specified in a user license. The user uses the desktop remote control to quickly access an application on a virtual desktop for information he needs to make important business decisions. Users can expect at all times, including the leap year day that:
- Download time is fast.
- Application response to user's requests is fast.
- Technical maintainer backs up the data in the background.
He rates the technical maintainer as having a good business reputation — reliable, fast, secured, and efficient in mitigating risks of cloud outages. If a cloud outage occurs, the user expects the provider to mitigate the risks of cloud outages by quickly:
- Failing over the application to a healthy SaaS cloud running at another data center.
- Fixing the damage.
- Returning it to the home cloud.
Developers can expect all business applications that ran well at the in-house data center will run well in the cloud. They can expect:
- Applications in the cloud will respond quickly to user's data requests.
- Resources will scale up and down smoothly.
- All numerical methods used in any applications are stable.
All developer expectations rely on user expectations on how the risks of cloud outages should be mitigated. Developers monitor and test performance using metrics as spell out in threshold policies. Poor performance can result in unexpected service outages leaving the users stranded without the information they need to make important business decisions.
If cloud outage occurs, the developer expects the technical maintainer to mitigate the risks of cloud outages by quickly failing over the PaaS to another healthy data center, and fixing the damage. The developer expects to be allowed by the technical maintainer to:
- Develop a resource pooling application that can automatically replicate resource instances (see Resources).
- Fail them over to the healthy data center.
- Set up a mechanism to trigger the pooling application when cloud performance slides to a certain level as set forth in threshold policies and the SLA.
Technical maintainer expectations
Technical maintainers are usually the providers or third parties for the providers. They ensure that technologies are properly used to migrate the in-house application to the cloud. Once in the cloud, they monitor and test performance using metrics spelled out in threshold policies to ensure the risks of cloud outages are mitigated. Ensuring good performance is important in maintaining the technical maintainer's good business reputation as reliable, fast, secured and efficient.
If cloud outages occur, a technical maintainer is expected to fail over to a healthy data center:
- SaaS applications
- Business applications developed on the PaaS
- Virtual machines on the IaaS
Once they are failed over, the technical maintainer is expected to fix the damage to:
- The hypervisor
- Virtual desktops
- Physical network infrastructure underlying the IaaS
The technical maintainer can set up a mechanism to trigger an alert when cloud performance slides to a certain level as set forth in threshold policies and the SLA.
Service level agreement's risk mitigation role
With cloud outages still frequent and unpredictable, a cloud SLA provides more protection to firms looking to safeguard their data. The SLA contributes to risk mitigation by providing guaranteed levels of service availability.
SLA should focus on guarantees on:
- Network uptime: Ensures the number of specified IP packets are successfully and received over the network at low latency within a given time frame.
- Server uptime: Ensures each server and related equipment are operational and available for service at a guaranteed uptime level.
- Customer service: Ensures resolution of customer service requests with a given time frame.
- Threshold level: Ensures threshold levels are maintained at a given guaranteed service availability.
- Leap year uptime: Ensures the leap year date is recognized, so that the guaranteed service availability level is maintained.
They should be combined as a single composite focus on resource uptime availability.
Cloud outage credits should be given to consumers when:
- Uptime performance falls below the guaranteed levels of service availability.
- Customer service requests are not resolved timely.
- Threshold levels are not maintained.
What are not counted as SLA outages are:
- Denial of service attacks, regardless of the target.
- Outages caused by the customers (developers) themselves are not counted as SLA outages.
- Scheduled or emergency network, equipment and/or facility maintenance.
Mitigating risks of cloud computing outages requires plenty pre-planning to resolve issues what the cloud outage risks are most likely to occur when a threat agent can take advantage of the cloud vulnerability. Users, developers and technical maintainers must get together as a team to discuss their expectations on risk mitigation and determine what the role of SLA in risk mitigation should be. Like with everything else in life, the most important of all a user or developer should do is to get a copy of the policy on mitigating risks of cloud outages.
- Get more information about building a service pooling application in "Build a cloud failover policy".
- See the articles the author refers to about SLAs for web services: "Use SLAs in a Web services context, Part 1: Guarantee your Web service with a SLA", and "Use SLAs in a Web services context, Part 7: Mitigate risk for vulnerability with a SLA guarantee".
- The Practical Guide for Service Level Agreements, published by the Cloud Standards Custom Council (CSCC), highlights the critical elements of a service level agreement for cloud computing and provides guidance on what to expect and what to be aware of when negotiating an SLA.
- Read more about cloud metrics in the Report on Cloud Computing to the OSG Steering Committee, written by the Cloud Computing Working in the Spec Open Systems Group.
- Learn more about cloud computing technologies at cloud at developerWorks.
- Follow developerWorks on Twitter.
- Watch developerWorks demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Get products and technologies
- Access IBM SmartCloud® Enterprise.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
- Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.