Standardize cloud SLA availability with numerical performance data

Balancing availability policy when the range in guaranteed service levels complicates SLA management


Like the Babylonian approximation of √2, exact answers are often impossible to obtain in practice; that's also true in calculating service availability guarantees as part of your cloud provider service level agreement. Variables in the provider/partner relationship, accumulative rounding and truncation calculation errors — there are a host of elements that can keep an exact answer from rising to the top.

But you can combine the use of numerical analysis of performance data and constraint on errors to create approximate solutions that work well in the real world. This article shows you how to start.

Goals and expectations for numerical analysis

The goal of using numerical analysis of performance data for SLA standardization is to ensure cloud service is guaranteed to be available at least 99.9 percent of the time by obtaining approximate solutions while maintaining reasonable bounds on errors. It should focus on how a SaaS application, PaaS environment, or IaaS virtual machines running in a cloud service can benefit from numerical analysis of data for use in service availability calculations.

In other words:

  • Choose an algorithm that does not let errors grow much larger in the calculation.
  • Build an algorithm by decomposing availability services into independent pools.

Let's take a look on what each cloud service type a customer can expect from a guaranteed service availability level.

SaaS users can expect high availability in terms of total hours of availability, the impact of outages in terms of hours and minutes, and the mean time to repair. They can also expect the failover of servers to a healthy data center to be accomplished in a few minutes.

PaaS developers can expect high availability after negotiating with a provider to allow developers to build proactive actions to halt the damage to the system. The actions include building simple services composed of a single host rather than multiple hosts to allow developers to create replicated service instances that can survive host failures and contribute to high availability. PaaS developers use numerical analysis to obtain approximate guaranteed service availability levels while maintaining reasonable bounds on errors.

IaaS infrastructure specialists can expect high availability after negotiating with a provider to build proactive actions to halt damage to the virtual machine system while maintaining high availability.

Since partner relationships can generate complex variables, let's look at those next.

Complex relationships of partners

While SLAs have traditionally been a contract between a service provider and a cloud service customer (an enterprise, business, or government agency), the expanding value chain for other services has made SLAs important for a myriad of often complex relationships between partnerships.

For example, the same service provider can provide services to:

  • Cloud service customers (SaaS end users, PaaS developers, and IaaS infrastructure specialists)
  • Vendors
  • Large enterprises
  • Businesses
  • Government agencies

The same vendor provides services to:

  • Network providers
  • Cloud service providers
  • Web service providers
  • Enterprises
  • Businesses
  • Government agencies

The same network provider provides network access services and offers guaranteed service availability of 99.9 percent to:

  • Other network providers
  • SaaS providers
  • State government agencies

The same enterprise provides private cloud services guaranteeing service availability of 99.999 percent to:

  • SaaS cloud services for city government agencies
  • PaaS cloud services to develop new SaaS applications
  • Other enterprises to run private IaaS cloud services

To compete successfully, companies must proactively manage the quality of their services. Since provisioning of those services is dependent on multiple partners, management of SLAs becomes critical for success.

When terminologies are not standardized, managing the SLAs in the complex relationships of partners becomes more difficult. Let's discover some of those terminology inconsistencies.

What is availability, anyway?

If points of failure of the system are not analyzed, then the system availability calculated (the SLA) will be flawed from the beginning. Complicating matters is the fact that different people have different definitions of availability. For instance:

  • Does scheduled downtime for maintenance count against your system availability calculation?
  • What system components does the system availability calculation cover? Database? Application? Firewall?
  • How are errors handled in the calculation of system availability calculation? Can reasonable bounds on errors be maintained during iterative process of calculating approximate solutions?

In its classic form, availability is represented as a fraction of total time that a service needs to be up. From a theoretical perspective, it can be quantified as the relationship of failure recovery time (also known as MTTR, mean time to recovery) to the interval between interruptions (MTBF or MTBI, mean time between failures or interruptions).

For an entire year of uptime — 365 days X 24 hours X 60 minutes = 525,600 minutes — uptime can be represented as the "nines" (as shown in Table 1 further in this article).

But before we get to availability goals and what the nines should really reflect, let's examine numerical analysis error types.

Numerical analysis error types

Many numerical methods are iterative; they are not expected to terminate in a number of steps. They involve repeating the calculation many times. Starting from an initial guess, iterative methods form successive approximations that converge to the exact solution only in the limit. All numerical methods involve approximations due to either limits in the algorithm or physical limits in the computer hardware.

During the operation of the iterative process, two types of errors are usually introduced: Truncation and rounding off. These errors may accumulate to such an extent as to dominate the calculation and make the result meaningless. Or they may stay within an acceptable range so that the result remains significant.

Truncation error is caused by truncating a mathematical procedure. Often a Taylor series is used to approximate a solution that can be truncated at different levels. The more terms that are retained in the Taylor series the better the approximation and the smaller the truncation error.

Rounding error is caused by representing numbers approximately. It is the difference between the calculated approximation of a number and its exact mathematical value.

Rounding errors originate from the fact the computers can only represent numbers using a fixed number of significant digits. The use of double precision in algorithms reduces the effects of rounding out error in numerical calculations. Double precision uses 64 bits with 52 digits used to represent the significant figures. This allows Π, for example, to be represented as 3.141592653589793 ... that is, 16 digits. The full range for number representation using double precision is ±2.2250738585072020x10-308 to ±1.7976931348623157x10308.

Once an error is generated, it will generally propagate through the calculation. If the error stays within the acceptable range and does not grow to be much larger during the calculation, the algorithm is said to be numerically stable.

Availability goals

The nines are a tempting target for setting availability goals. As shown in Table 1, the nines consist of having all numeral nines in a percentage that represents the time the system is available. The issue with any availability goal including those that do not have all nines in the percent is that the goal does not reveal what service component availability it represents.

By the way, it's quite possible the nines were created originally partially due to a computer hardware issue on a long-gone mainframe computer. For my numerical analysis project in FORTRAN, I was aware that when the mainframe converted 1/10 to 0.10, the machine internally calculated it as .099999999... due to hardware limitations on registers. My program displayed the result as 0.10.

Let's take a look at the table and assume the website is running e-commerce 7 days a week, 24 hours a day.

Table 1. Table of fractional outages: The nines
Availability measureDowntime per yearDowntime per week
90% (one nine)36.5 days16.8 hours
99% (two nines)87.6 hours101.08 minutes
99.5%43.8 hours50.54 minutes
99.8%1,052 minutes20.22 minutes
99.9% (three nines)526 minutes10.11 minutes
99.95%4.38 hours5.05 minutes
99.99% (four nines)53 minutes1.01 minutes
99.999% (five nines)5 minutes≤ 6.00 seconds

One handy way to think of nines in a 365 x 24 year is in orders of magnitude:

  • Five nines represents five minutes of downtime
  • Four nines represents about 50 minutes
  • Three nines, 500 minutes
  • etc.

Every tenth of a percentage point per year is roughly 500 minutes of downtime. Of course, for services that don't need to operate 24-hours-a-day-seven-days-a-week, such as factory-floor applications in a single location, the outage minute numbers will vary based on the local operational window.

It should be readily apparent that getting past one minute of downtime per week can be quite an expensive proposition. Redundant systems that double the hardware required — in extreme cases, down to specialized fault-tolerant processes that compare instructions at every clock — and complex software that can handle the redundancy are just the beginning. The skills to deal with the complexity and the system's inability to handle change easily drive up the cost.

Moreover, experience shows that people and process issues in such environments cause far more downtime than the systems themselves can prevent. Some IT operations executives are fond of saying that the best way to improve availability is to lock the data center door. Be that as it may, any foray into high availability goal-setting should begin with a careful analysis of how much downtime users can really tolerate and what is the impact of any outage.

The nines are a tempting target for setting goals; the most common impulse for any casual consumer of these nines is to go for a lot of them. Before you succumb to the temptation, bear in mind one thing — you can't decide how much availability you need without first asking "availability of what?" and "what does downtime per week/year represent?".

For instance, consider the following:

Availability measure = 98%
Downtime per year = 7.3 days
Downtime per week = 202.15 minutes

Calculate as follows:

525,600 minutes x 2 percent = 10512 minutes downtime per year
10512 minutes/52 weeks = 202.15385 minutes downtime per week

They did not put in:

202.15385 minutes/7 days = 28.879121 minutes downtime per day

Because it would round to 28.88 minutes, giving off rounding errors:

365 days/52 weeks = 7.0192308
Round to 7 days
The difference is 0.0192308 X 52 = .99999952

Round to 1 day and then you get 8 days per week, which is impossible!

Is 99.9 a result of rounding off or truncation of numbers? This depends on what numerical method is being used in calculating the availability level.

Function equivalents: Watch for loss of significance

Cloud providers typically provide redundant services and servers to mask some types of system failures. Some providers do not offer algorithms to provide more effective approach to obtaining approximate availability metrics while obtaining acceptable bounds on the errors generated during the calculation. If the errors exceed the bounds, the metrics so obtained will show loss of significance.

PaaS developers most likely have more expertise than the provider in developing an algorithm that solves a well-conditioned problem — such an algorithm may be either numerically stable or numerically unstable. An art of numerical analysis is to find a stable algorithm for solving a well-posed mathematical problem.

For example, computing the square root of 2 (which is roughly 1.41421) is a well-posed problem. Many algorithms solve this problem by starting with an initial approximation x1 to √2; for instance, start with x1=1.4 and then compute improved guesses x2, x3, etc. One such method is the famous Babylonian method, which is given by xk+1 = xk/2 + 1/xk. Another iteration, which we will call Method X, is given by xk + 1 = (xk2-2)2 + xk.[3]. I've calculated a few iterations of each scheme in Table 2 with initial guesses x1 = 1.4 and x1 = 1.42.

Table 2. Two iteration methods
BabylonianBabylonianMethod XMethod X
x1 = 1.4x1 = 1.42x1 = 1.4x1 = 1.42
x2 = 1.4142857...x2 = 1.41422535...x2 = 1.4016x2 = 1.42026896
x3 = 1.414213564...x3 = 1.41421356242...x3 = 1.4028614...x3 = 1.42056...
x1000000 = 1.41421...x28 = 7280.2284...

Observe that the Babylonian method converges fast regardless of the initial guess, whereas Method X converges extremely slowly with initial guess 1.4 and diverges for initial guess 1.42. That means the Babylonian method is numerically stable while Method X is numerically unstable.

Numerical stability is affected by the number of the significant digits the machine keeps on. If we use a machine that keeps on the first four floating-point digits, a good example on loss of significance is given by those two equivalent functions for illustrative purposes:

   f(x) = x(√x+1 - √x)


   g(x) =  -----------
            √x+1 + √x

If you compare the results of these two:

  f(500) = 500(√501 - √500)
  f(500) = 500(22.3830 - 22.3607)
  f(500) = 500(0.0223) = 11.15


  g(500) =    ---------------
                √501 + √500

  g(500) =  -------------------
             22.3830 + 22.3607

  g(500) =       ---------

  g(500) =        11.1748

You realize that loss of significance — which is also called subtractive cancellation— has a huge effect on the results, even though both functions are equivalent. To show that they are equivalent, you need to start with f(x):

  f(x) = x(√x+1 - √x)

                       (√x+1 + √x)
  f(x) = x(√x+1 - √x) -----------
                       (√x+1 + √x)

            ((√x+1)2 - (√x)2)
  f(x) = x ------------------
               (√x+1 + √x)

  f(x) = -------------
          (√x+1 + √x)

And end with g(x) like so:

   f(x) = 99.99991846
   g(x) = 99.99988723

The true value for the result is 11.174755, which is (500) = 11.1748 after rounding the result to four decimal digits.

Now suppose you enter a different value of x for f(x) and g(x); for a set of function equivalents for use with availability metrics showing the following results:

f(x) = 99.99991846
g(x) = 99.99988723

The error difference is .00006879.

The issue is which function to use: f(x) or g(x). So that rounding and truncation errors do not grow to be much larger during the calculation, one partner may use f(x) while another may use g(x). The third partner may use f(x) and g(x) alternatively in different steps of the calculation.

As part of the best practice efforts, the partner should agree which function equivalent to use that might be more numerically stable than the other.

In a Fortran project I worked on, I evaluated different functions by comparing:

  • What the results of each in, say, six decimal places were.
  • How large the errors grew in the calculation.

I chose the function that was the most numerically stable in calculating 100 decimal places so that rounding and truncation errors would not grow much larger during the calculation.

Reducing failover impacts on high availability

High availability calculation is a composite of availability levels for components of the system, such as database servers, application servers, and firewalls. Each component comes with redundant servers to reduce the impact of failover on high availability. Better yet, enterprising PaaS developers can build an algorithm to decompose services into independent pools.

When failures happen, the developer's software should quickly identify those failures and begin the failover process.

There are three ways of getting reliable and numerically stable availability metrics:

  • Decompose availability metrics into availability services that can be measured.
  • Provide additional redundant servers and services without incurring prohibitive costs.
  • Build an algorithm to decompose services into independent pools rather than as multiple redundant services.

Let's focus on the third choice. Suppose if the developer has a billing application that consists of business logic components A, B, and C, he can compose a service group like this:

(A1, B1, C1), (A2, B2, C2) ... (An, Bn, Cn)

Where n is the number of component type representing the number of a service group category.

For service category 1:

  • A1 is the logic to find service code
  • B1 is the logic to insert service category into the bill
  • C1 is the logic to check the accuracy of zip codes

For service category 2:

  • A2 is the logic to find service code
  • B2 is the logic to insert service category into the bill
  • C2 is the logic to check the accuracy of zip codes

For service category n:

  • An is the logic to find service code
  • Bn is the logic to insert service category into the bill
  • Cn is the logic to check the accuracy of zip codes

If a single virtual machine running the PaaS fails, the failure results in the loss of the entire system group. This means if component A1 fails in one system group, the other two components, B1 and C1, fail. If there is more than one service group, the entire system fails.

To fix the problem, the developer decomposes the components into independent pools like this so that he can create multiple redundant service copies at healthy data centers:

(A1, A2,...An ), (B1, B2...Bn ), (C1, C2, ...Cn)

This means if component A1 fails or slows down, all other components A2,...An in the same independent pool do not fail. The second independent pools of components B1, B2...Bn and the third pool of components C1, C2, ...Cn do not fail.

The developer uses quick timeouts and retries of the slow services while the failover of all independent service pools of the billing application is in progress. The developer needs to determine when to halt timeouts and retries to avoid system lockup resulting from consumption of all resources waiting on slow or failed services.


In planning for SLA standardization, consider best practices of developing and following a standard way of doing things that multiple partners can use. In addition to new terminologies, numerical analysis of performance data should be considered for obtaining more reliable availability metrics while maintaining reasonable bounds on the errors. When the partners agree on the standard way of using numerical analysis of performance data when negotiating SLAs, they contribute to the process of SLA standardization

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Cloud computing
ArticleTitle=Standardize cloud SLA availability with numerical performance data