Organizations building or using "enterprise" class computing environments have to accept the fact that the Internet boom has imposed higher availability levels than IT departments have dealt with in the past. Applications today are not only doing more work but also have more users, often spread out across the globe, and requiring 24/7 availability. A common dilemma for IT is how to survive planned and unplanned outages in the enterprise while achieving other service level requirements, such as throughput and response times, and making efficient use of server resources.
While many organizations have provided high availability systems on centralized (e.g. mainframe) platforms for some time, the number of critical applications now being deployed on distributed (e.g. UNIX, Windows) platforms is now growing significantly. Although, in theory, distributed systems can approach the levels of availability achieved on mainframes, in practice this requires the careful design, configuration and operation of sophisticated technologies and redundant capacity, in addition to the base operating platform.
A number of factors must be considered to appropriately plan for availability. This article examines the relationship between hardware availability, software availability, and the various capacity planning considerations and user load factors. It also attempts to describe the processes and activities involved in designing computing environments with the required availability and capacity.
Finally, the mathematics required to determine the level of availability likely to be achieved by a specific environment topology is relatively straightforward, but it is not universally known or understood. A spreadsheet that can be used to simplify availability calculations is provided with this article, and its use in performing calculations is explained in later sections.
The other sections of this article are organized as follows:
- The basics of capacity, redundancy and availability introduces the basic concepts involved in designing a high availability solution.
- Availability requirements for individual components gives an overview of the components for which availability solutions must be considered, and the types of solutiosn required for each component.
- Calculating availability introduces the mathematics required to calculate the availability of a particular topology and the spreadsheet accompanying this article that can be used to assist with the calculations.
- Designing availability solutions in the end-to-end project lifecycle indicates the activities relevant to the design of availability solutions at different points in the overall project lifecycle.
Finally, the Conclusion and References sections summarize the content of the article, and provide some links to further material.
1. The basics of capacity, redundancy and availability
To thoroughly understand the rest of this article, it is necessary to understand exactly what it means to have capacity, redundancy and availability, and how these factors are related in the enterprise.
1.1 Server capacity
Server capacity is the raw horsepower needed to run an application. Different applications have different server capacity requirements: some applications are CPU intensive while others may be memory intensive. The required server capacity to run an application is the basis for sizing the server itself. Several applications running together typically require larger servers with increased CPU and/or memory requirements. Purchase too much capacity, and some of the investment in the hardware goes idle. Purchase too little capacity, and the applications compete with each other for resources and negatively affect overall performance.
Ideally, a server running at 90%+ utilization in both CPU and memory would be the best utilization of invested resources; however, this leaves little room for spikes in demand, and we donÂÂt live in a world with a statistically uniform distribution of demand. Additionally, it can be difficult to predict incoming load characteristics while at the same time adjusting for seasonal differences, so allowing additional capacity by planning to run servers at well less than 90% is the norm. Planning out availability and server capacity in this fashion will continue to be a necessary task until enterprise environments start implementing "on demand" technologies that automatically adjust resource availability to handle changing workload conditions.
Figure 1. Server Capacity is a function of CPU and memory utilization as user load increases, and how this affects response time of the application
In Figure 1, which shows application characteristics at different user load levels, we see the interaction between server capacity and performance. The number of users starts at some base value increasing along the x-axis where the load levels are doubled, tripled, etc. on up to 6x the original number of users. The server capacity requirements of this application are defined by where memory and CPU limits are reached. This happens when either memory or CPU start to spike which, in this example, is when 4x times the number of users are exercising the application causing the response time to increase. In this example, when the expected user load is 4.5x or higher, more than one physical server is required to provide adequate capacity to process the incoming requests with a reasonable response time.
1.2 Redundant server capacity
The provision of redundant server capacity is a key aspect of planning availability in the enterprise. A redundant server is either an active participant in the workload cell, or remains idle as a "hot standby" in case the primary server fails. Redundant servers have little return on investment, except in failure scenarios where they prove their worth by picking up the workload that would otherwise have not been processed by a downed server. Providing redundancy is a consideration for those organizations that either provide timely information, or manipulate "other people's money" or personal information. Redundant servers must be sized according to the required application server capacity in the production environment. Having redundant servers smaller than their production counterparts only results in negative performance related to decreased server capacity. This is not ideal, although in some cases it may be acceptable to have at least slow throughput rather than no throughput at all.
Figure 2. A cluster of servers, where two of them, while active participants in the workload, are redundant
Figure 2 shows a cluster of servers where two of the redundant servers are active participants in processing the incoming workload. In this scenario, the environment has been scaled such that any two of the servers in the cell can be taken out of service and the incoming workload is not affected. This is because the capacity of the servers in the cell are sized so that any four servers can handle the entire incoming workload, and the additional capacity provided by the two redundant servers only comes into play when other servers in the cell are taken out of service. Such outages can be planned or unplanned events, such as when restarting servers in the cell after an application update, or when a server power supply fails.
Server redundancy is not that common in distributed UNIX or Windows platforms. This is due in some part to the fact that true high availability in the distributed environment was not necessary until applications in the enterprise evolved from small, in-house workloads to considerably higher workloads, often opened up to the world on the Internet. Historically, redundancy has been used to eliminate single points of failure in the hardware of individual servers, thereby increasing their availability. For example, RAID and hot swap CPU cards have been common hardware redundancy features at the single server level.
In order to support true high availability, redundancy has to be applied to each of the separate tiers in the enterprise architecture. In the classic mistake, an enterprise chooses to cluster its Web servers and application servers but not the database servers. The single point of failure then turns out to be the database server which takes all the applications off line when it goes down. High Availability (HA) solutions exist for every tier in the enterprise and should be explored when planning for availability. Techniques for high availability of most of the common components in a WebSphere Application Server architecture can be found in the Resources section.
1.2.1 Location, location, location
Redundant servers are not necessarily located in the same geography. In Figure 2, the servers can be split up into pairs, where each pair of servers resides in a physically separate data center. Such deployment of redundant capacity has additional network overhead and costs. However, a catastrophic disaster at any one of the data centers would not affect the operations of the other physically separate data centers. The actual administration of separate locations can be further decomposed into smaller sub-cells that are manually brought online only when necessary. However, note that the full scope of disaster recovery solutions is outside the scope of this document, and involves additional redundancies in electrical power, internet connectivity, and other factors.
1.2.2 Process isolation in redundant environments
Process isolation is another benefit provided by redundant servers within a workload cell. Applications on one server that is stopped or is in the process of failing do not negatively impact applications running on other servers. While ideally one works out application problems in testing rather than deploying frail applications into the production environment, in practice failures still occur, or it may be necessary to stop applications or servers in order to upgrade them or perform backups, so process isolation is nevertheless a useful capability for an enterprise infrastructure. Process isolation can also be provided in multiple ways in the same infrastructure; for instance, applications can be clustered within application servers that are, in turn, clustered horizontally across the redundant servers in the enterprise.
Figure 3. Process Isolation provides reliability for buggy applications.
In Figure 3, process isolation occurs both within the context of a single server, and horizontally across the servers. Processes A1, A2 and A3, on Server 1 are each separate instances of the same application A. Through cloning of the application on the same server, the application, should it go down for any reason, continues to process incoming requests through the remaining clones. Such process isolation is a convenient technique for running buggy applications, though in some cases running multiple copies of an "ill-mannered" application can result in an outage of the entire server, not just an application server process.
Additional isolation is achieved by running more instances of application A on the physically separate Server 2 (A4, A5, A6) in a horizontally configured cluster. The isolation in this case not only includes the process isolation of three additional, separate instances of application A, but also the explicit hardware isolation of the application processes on the horizontally separate server. This horizontal isolation of the application provides the capability of the application to continue processing requests regardless of the running status of any one particular server. The failure of a Network Interface Card (NIC) on one server would not affect the applications on the other.
Finally, note that the same concepts can obviously be extended to provide isolation between different applications, e.g. to shield the availability of a reliable application from the failures of a buggy application.
In designing a server topology to support high availability, the planner must take into account the types of servers required, the application workload capacity they must support, and the level of availability, or "up time", required. Capacity requirements come from performance testing the application and understanding what the incoming load characteristics are in production. The number of redundant servers is then determined by the Service Level Agreement (SLA) requirements. This is when the relationship between capacity, redundancy, process isolation and availability starts to mesh together.
Figure 4. Process isolation and redundancy providing availability
Availability is provided in this example through the combination of having:
- Determined the appropriate server capacity for the application.
- Determined the effective cost of planned and unplanned outages, and hence the justifiable expense of additional redundant capacity to prevent them.
- Appropriately sized the servers for additional process isolation of redundant application processes.
- Providing the correct number of servers for handling the expected incoming workload.
- Adding two additional servers as redundant participants in handling the incoming workload.
In Figure 4, the application is cloned vertically on each application server providing the necessary process isolation in the event one of the application instances spontaneously terminates. The process isolation is preserved horizontally across the application servers should any of the servers themselves have to be taken out of service. The redundant servers, #5 and #6, provide the ability to continue processing incoming requests, should two of the servers need planned or unplanned maintenance.
While redundant servers are an additional cost, high availability is achieved through redundant servers should outages occur. How to plan the number of required redundant servers has to be conducted in such a way so as to stay within some reasonable budget, yet have some assurances of availability. The rest of this article discusses the accompanying spreadsheet and the necessary mathematics of availability.
2. Availability requirements for individual components
As mentioned above, it is important in a high availability infrastructure to ensure that all the components are made highly available, not just the WebSphere Application Servers and Web Servers. For each component, the availability of the following needs to be considered:
- Processing capacity: If a component fails, what will pick up the processing workload?
- Configuration / static data: What mechanism exists to ensure that any relevant configuration or static data (e.g. Web content) information of a failed component is correctly passed to the backup?
- Dynamic data: What happens to any dynamic data managed by a failed component ; e.g. is the data lost, frozen until the component is recovered, or passed by way of shared or replicated storage to the backup?
High availability of common components in a WebSphere architecture is covered in Resources; however, the following table lists some of the components and their characteristics that should be considered for failover:
Table 1. Characteristics to consider for failover
|Component||Processes to failover||Static or configuration data to failover||Dynamic data to failover|
|WebSphere Application Servers||Application servers, Administration Server (V4) Node Agent (V5) or Deployment Manager||Application deployment, connectivity and tuning configuration||Transaction logs if 2-phase commit is in use, HTTPSession|
|WebSphere Message Queue Servers||WMQ Processes||Queue configuration||Queues|
|HTTP Servers||HTTP Daemon||Web Content, configuration||--|
|Network elements, e.g. HTTP Switches, Routers, Reverse Proxies, Firewalls||Process||Configuration||--|
|Directories||Directory process||Configuration||Directroy data|
|Backend Applications (CICS, ERP etc.)||Application process||Configuration||Assume data is in related database|
|Authentication Server||Authentication Process||Configuration||--|
|Systems Management||Process||Configuration||Event data|
The availability of an application extends beyond the mere availability of the application itself. There are some scenarios where high availability of the application is markedly affected by the backend system it is attempting to access. Scenarios where applications are interfacing to legacy applications that are not available in HA configurations or to 3rd party providers, that may or may not be available at various times, can negatively affect the availability of the application. These "soft" requirements simply add more fudge to the fudge factor.
Note that in addition to the relatively obvious infrastructure components discussed above, factors such as the availability of Internet connectivity, power supplies etc. must also be considered.
3. Calculating availability
Figure 5. Overall environment topology example
Figure 5 shows a schematic overall topology of an "environment", from ISP to backend. This section discusses how to analyze or calculate the overall availability of a system deployed in such an environment. The table below describes the individual components shown, but note that overall:
- The overall topology is only "available" if all the individual components are available.
- Most of the "components" in the schematic are likely to be implemented with some degree of redundancy.
Table 2. Common topology components
|ISP||Internet Service Provider, i.e. the connection between the environment and the wider network.|
|Router||Network routing between the ISP and the network in which the environment is hosted.|
|Switch||Technology that distributes requests across a cluster of Servers -- both HTTP and Application servers have switches in this example, although in some cases the switch might be integrated in another component (e.g. in the case of WebSphere Application Server, the "switch" is implemented in the HTTP Server plug-in, on the other side of the firewall than shown in the generic diagram above).|
|Backend||Existing applications (e.g. CICS, databases, ERP etc.) to which the application servers connect. In many cases more than one backend system is involved, in which case these would be modeled as separate components in the overall architecture.|
|Auth server||Authentication Server.|
|LDAP||Used to indicate a directory component, often an LDAP directory.|
|Tx||Used to indicate any separate component responsible for managing transactions. For example, in the case of WebSphere Application Server v5, this would refer to the WebSphere Network Deployment Node Agents and the directories in which they store transaction logs -- these need to be considered separately from the base application servers.|
|Network||The network within which the environment is deployed. It would be possible to treat each element of the network (e.g. routers, bridges etc.) separately, but it is often appropriate to treat the network as a monolithic component with a single availability characteristic.|
|Power supply||The power supply on which the entire infrastructure depends -- i.e. this does not include the separate power supply for each server, just the source on which they all depend.|
|Physical building||The building in which the environment is housed.|
In an environment such as this, the basic approach to calculating availability is:
- Determine the availability of each component, as a number from 0 to 1.
- Multiply the availabilities of all components together to get the overall availability, usually expressed as a percentage.
Mathematically, this is stated as:
At this point it is worth addressing the identification of the "physical building" as a single point of failure: this is true from the perspective of a single environment; however, many real applications are deployed across two or more physical locations to mitigate this risk. The techniques in this article can be expanded to a degree to cover that case, as described in the Calculation to multiple physical sites section below.
The next question to address is how to determine the availability of individual components. There are several ways to do this, including:
- Most of the components are actually clusters or active/passive pairs. See Calculating the availability of clusters below for information on calculating cluster availability.
- Some components (e.g. "network") consist of a great many individual components, the structure of which might be under different control than the remainder of the topology. If your purpose is to calculate the availability of a specific network topology, the calculations in this section will help. If, however, your purpose is to calculate the availability of a server topology, it might be better use overall network availability figures from experience -- i.e. the known mean time between failures and mean time to recovery, or if these are not available, to make estimates. See Calculating the Availability of Individual Components below for details.
- Some components (e.g. "application server"), consist of a stack of individual components, but one that is simple enough to analyze the constituent layers. See the Calculating the availability of individual components section.
- Some components (e.g. "ISP") might be entirely out of control of the owners of the environment and its topology, and might be subject to contractual availability figures, in which case these can be used.
The rest of this article will describe calculations to determine the availability of single nodes, clusters, etc., after which youÂÂll have a figure for the availability of each component in your overall topology. You will then be ready to calculate its overall availability, as above. Spreadsheet sheet 1, "Overall Chain", will help you to do that.
First, however, it's worth expanding on something alluded to above: the relationship between "availability" and "probability". Essentially, they are the same thing, except for the fact that availability is often expressed as a percentage whereas probability is expressed as a number from 0 to 1. So, from this point of view probability = availability% / 100.
The next step is to relate "availability" to the frequency at which failures are experienced, and the duration of those failures. The next section explains that calculation. However, it is worth noting that converting the information concerning how often failures occur and how long they last into an "availability" figure averages out the failure information. So, if you start with servers that exhibit disk failure once every five years for two weeks, but power supply failure once every three weeks for a day, that all gets rounded up into one figure -- so you lose any information specific to the more serious two week failures. Calculations of that sort are beyond the scope of the current discussion.
3.1 Spreadsheet calculations for the availability of the overall solution from individual component availabilities
Spreadsheet sheet 1, "Overall Chain", calculates the availability of the overall solution from the availabilities of individual components, and is shown in Figure 6, below. The sheet provides a table of 10 rows, each containing space for 10 components of the overall solution. For each component, you can enter a name or description and an individual availability figure as a percentage. The "Overall Availability" cell then provides the overall solution availability.
Simply enter the names and availabilities of your components in as many cells in the table as you need. There are likely to be cells left over, so leave those blank. The order in which you enter information for the components in the cells does not matter
Figure 6. Overall solution availability calculation on sheet 1: "Overall Solution"
3.2 Calculating the availability of individual components
"Availability" tells you the percentage of the time that a component is "up" rather than "down". At a simple level, it is comprised of two things: first, how often does the component fail? And second, when it does fail, how long does it take to recover it?
Taking an example, a reasonable assumption is that UNIX hardware tends to fail (for example, the power supply breaks) around once every few years, let's say three. Many UNIX operators have a support contract with their vendor that guarantees a replacement or repair within 24 hours. Therefore, the availability of a typical UNIX server is:
That is the basic equation we will use to calculate availability of individual components -- you can find it on spreadsheet sheet 5, "Capacity & Availability".
It is worth pointing out that the calculation above uses the specific phrase "mean time to failure" rather than "mean time between failures". There is a subtle but important difference between the accepted meaning of these phrases:
Mean time to failure is the average continuously operational period -- i.e. the average period between the end of one failure and the beginning of the next.
Mean time between failures is the average elapsed time between the beginning of one failure and the beginning of the next.
It is important to understand the precise definition of those phrases, and recognize that these calculations use mean time to failure. Mathematically, the two are related as follows:
3.2.1 Spreadsheet calculations of availability from mean time to failure and mean time to recovery
Spreadsheet sheet 5, "Capacity & Availability" contains several calculations, the first of which is labeled "Mean time to failure calculations", and is shown in Figure 7, below. This provides an availability figure based on mean time to failure information in days, and mean time to recover information in hours.
Figure 7. Availability calculation using mean time to failure and mean time to recover on sheet 5: "Capacity & Availability"
3.2.2 Calculating the availability of components that have a "stack"
In a topology supporting an application, the availability of each node is a combination of several things:
- Availability of the hardware.
- Availability of the operating system.
- Availability of the application server.
- Availability of the application.
The process to figure out the actual availability of the node is similar to that used for the overall topology: all these things must be available for the node as a whole to be available, so you need to find the availability of each element of the "stack", then multiply them all together to get the availability of the node. This approach is identical to the approach to the overall solution, described above, and can also apply to some components that consist of more than one node, as described in Server clusters with shared high availability disks, below. Spreadsheet sheet 4, "Stacks", calculates the availability of a stack where every element must be available in order for the entire stack to be available.
Sometimes it may be either difficult or inappropriate to calculate the individual availability of some elements of a stack, for example:
- If you know the mean time between failures and the mean time to recovery, e.g. for the hardware, use the calculations in this section.
- If you do not know that information for some elements of the stack (e.g. because the "application code" has not been tested yet, or so far has tested fine), you need another approach. Assuming 100% availability for these elements will give you a "best case" scenario. Alternatively, you can test more extensively, guess, or wait until you go into production and can get real data on failures.
- Some of these components may not be your responsibility. For example, if you are asked to design an infrastructure to support 99.9x% availability, but the application code is the responsibility of someone else, then you can consider assuming 100% availability, and build this caveat into the specification of the topology. Otherwise, you might ask the application to provide some measure of the expected frequency and severity of failures.
220.127.116.11 Spreadsheet calculations for availability of stacks
Spreadsheet sheet 4, "Stacks", calculates the availability of a stack, and is shown in Figure 8, below. The sheet provides 10 tables, each intended to be used to calculate a separate stack. Each table has a label for the stack, e.g. "Application Server Node", and ten cells to describe the characteristics of individual elements of the stack, e.g. "hardware", "operating system", "application server middleware" etc.
For each element, you can describe the availability in two ways. If you have mean time to failure and mean time to recovery information, enter that and the spreadsheet will calculate the availability for you in the "Calculated availability contributions" lines. If you do not have that information, you can enter your own availability figure in the "Override availability contributions here" lines. If you specify both, then the overridden availability contributions are used.
The calculated availability of each stack is shown in the "Stack 1 availability", "Stack 2 availability" etc. cells at the bottom of each table.
Figure 8. Stack calculation on sheet 4: "Stacks"
At this point a note is due regarding the "cloning" of, for example, application server processes on a single node, a practice common with WebSphere applications. In this case, you have a mini "cluster" of processes on a single node. So, in order to calculate the combined availability of the "application server and application" element on an "Application Server Node", you could take the following approach:
- Assume the application will have an occasional bug in it (e.g. a failure to release database connections) that requires the application server to be restarted once a month.
- Restarting the application server takes 5 minutes.
- Have four clones of the application server running on a single node, or physical server.
- Taken together, the availability of a single application server is 0.999999968 using the calculations in this section.
- Assuming the node will keep running, providing two of the application servers are available (a reasonable assumption, given that any outage is only for a few minutes), use these figures to perform a cluster calculation, as in Calculating the availability of clusters below.
- That cluster availability figure can be used as the availability contribution of the "application server and application" element in the "Application Server Node" stack calculation.
Finally, note that calculations such as the above tend to give very high availability figures ÂÂ in the example given, the "application server + application" availability is 100.000000% (i.e. 100% to 6 decimal places). ThatÂÂs an indication that this contribution to overall availability is minimal, unless you have a particularly buggy application.
3.3 Calculating the Availability of Clusters
Calculating the availability of clusters depends on knowing the basic availability of each node in the cluster. Once this is known, the formulae required to calculate the availability of the cluster are relatively well known, but can be rather tedious to calculate.
For example, consider a cluster of two machines, where the cluster as a whole is considered "available" if at least one of the two machines is available, then the overall availability can be calculated using the following table, assuming the availability of one node is 99.90% (so the probability it is available is 0.9990, implying that the probability it is unavailable is 1.0000-0.9990 = 0.0010):
Table 3. Cluster availability
|Server 1 available?||Server 2 available?||Cluster "available" in this configuration?||Probability of this configuration|
|Yes||Yes||Yes||0.9990 x 0.9990|
|No||Yes||Yes||0.0010 x 0.9990|
|Yes||No||Yes||0.9990 x 0.0010|
|No||No||Yes||0.0010 x 0.0010|
The mathematics of probability then tells us that the probability of the cluster as a whole being available is the sum of the probabilities of all those individual configurations that count as "available". In this case, that works out as 0.99999917 or 99.999917%.
The problem with this approach is that it is not scalable -- as the number of servers (let's call it "n") rises, the number of rows you need in the table is 2 to the power n -- so for 10 servers, you need 1024 rows. Fortunately, there are formulae that can shortcut this -- the theory of combinations provides a shorter way to calculate it. Basically, in a cluster of, say, 10 servers, you can have a limited number of combinations you are interested in: i.e. no servers down, one server down, two servers down ... all the way to ten servers down. It is quite easy to calculate the probabilities of those; the problem is that there are a lot of different ways that, say 5 out of 10 servers can be down. Combinations calculates how many ways there are of doing that quite easily, so you can then find the total probability that 5 servers are down. If youÂÂre interested in the formulae for doing this, take a look at my old Physics textbook in the Resources, otherwise, the spreadsheet will do this for you. You can find the calculation on sheet 3, "Clusters".
3.3.1 Spreadsheet calculations of cluster availability
Spreadsheet sheet 3, "Clusters", is shown in Figures 9 and 10 below. The sheet asks you to enter the number of nodes in the cluster, the mean time to failure and mean time to recovery for an individual node, and the number of nodes that can fail while still considering the overall cluster to be "available" (see Redundant Capacity and Availability below for more discussion on this point).
As in Spreadsheet Calculations for Availability of Stacks, above, you may sometimes want to specify an availability figure for the individual nodes rather than the mean time to failure and mean time to recovery information. If so, you can enter this in the "Override calculated single node availability %" cell, as shown in Figure 10 below.
The spreadsheet then calculates two figures for you: first, the overall cluster availability is shown in the "Cluster Availability" cell. Second, the "Capacity redundant in normal operations" cell indicates the percentage of cluster capacity that is redundant when all servers in the cluster are operational. This is simply the capacity of the number of servers that are allowed to fail and still consider the cluster available, expressed as a percentage of the total. This is discussed further in Redundant capacity and availability below.
Figure 9. Cluster availability calculation on sheet 3: "Clusters"
Figure 10. Cluster availability calculation on sheet 3: "Clusters", using override of calculated single node availability
3.3.2 Redundant capacity and availability
Redundant capacity becomes relevant to availability calculations when you consider how to define when a cluster is and is not "available". For example, in a cluster of 10 servers, is it "available" when 1 server is down? Or when 2 are down? Or when 9 are down?
This question is tied intimately to the amount of redundant capacity you are prepared to accept in normal operations. If, for example, you define "available" as 8 out of 10 servers, youÂÂre accepting that in normal operations 2 servers are redundant -- i.e. 20% of your overall capacity. Spreadsheet sheet 5, "Capacity & Availability", will help you make these calculations, but the basic point is that before you can design an infrastructure for a specific level of availability, you need to decide how much redundancy you're willing to pay for, then you can specify how many servers you are prepared to tolerate going down and still call the cluster "available".
Sheet 4, "Capacity & Availability", contains two calculations relating redundant capacity and availability. The first of these, "Variant 1", is shown in Figure 11 below.
The "Variant 1" calculation asks you to enter the number of servers you require in order to support your operating capacity (i.e. without any redundancy). You can then specify, as a percentage, what level of additional capacity youÂÂre willing to invest in to support availability. For example, you might need 10 servers to support your workload, at a cost of $100,000. You might then be willing to spend a further $25,000, or 25%, on additional, redundant servers to provide increased availability. In this case, the calculation seems trivial, but it indicates you need to buy two and a half extra servers. The spreadsheet will round up the number of servers you need to buy and indicate the actual redundant capacity -- in this case, 3 extra servers providing 30% redundant capacity. (Note that further discussion comparing the costs of providing availability solutions against the costs of the failures against which they protect can be found in Gathering Requirements below.)
Finally, the spreadsheet also calculates the redundant capacity as a percentage of the total cluster size -- i.e. including the additional servers deployed to support availability. So while the redundant capacity in this example is 30% of the 10 servers required to support the workload, it is 23.08% of the total cluster size of 10 + 3 = 13. The two percentages are identified in the spreadsheet respectively as "Actual redundant capacity as percentage of number of servers required for normal capacity" and "Redundant capacity as percentage of total number of servers required to support availability".
Finally, it is sometimes interesting to see how these calculations work out as the server size varies. So, as well as performing calculations for whatever server size and redundant capacity you specify, the spreadsheet has a table showing how your specified redundant capacity works out when 1, 2, 3 ... 10 servers are required to support normal operating capacity.
Figure 11. First variant of availability and capacity calculation on sheet 5: "Capacity & Availability"
The second calculation, "Variant 2", is shown in Figure 12, below. This is a simple calculation that assumes you know the cluster size and the number of servers you are willing to allow fail and still regard the cluster as "available" as a whole. In this case, the calculation tells you what percentage of your overall cluster is effectively redundant in normal running. Again, a table is provided to illustrate how the calculation changes as cluster size changes. In this case, the table shows the redundant capacity for clusters of size 1 to 10 when the specified number of servers are allowed to fail.
Figure 12. Second variant of availability and capacity calculation on sheet 5: "Capacity & Availability"
3.3.3 Server clusters with shared high availability disks
It is relatively common for server clusters to share the same high availability disks in some situations, specifically those in which not only the server process must be highly available, but dynamic data required by that process must also be available. Some additional examples include:
- Asynchronous messaging.
- Transaction managers (e.g. the WebSphere Application Server Deployment Manager).
In some cases, such as High Availability UNIX hardware, the combination of two or more servers plus shared disks is treated as a single component. In other cases, though, a cluster of individual servers might have specific directories in their file systems mapped to a high availability disk array to provide all servers in the cluster with high availability access to, for example, in-flight transaction logs and message queues.
In that case, the availability of the clusterÂÂs capability to support data and process is a combination of the server cluster availability (as calculated above, based on a number of servers that may be available or unavailable separately) and the high availability disk array availability on which the entire cluster depends -- i.e. this is now a "stack" calculation. The stack is shown in Figure 13, below, and the availability is calculated as:
Figure 13. Server cluster supported by high availability disk array
3.4 Extending the calculation to multiple physical sites
As indicated above, the techniques described here can be used as the basis for availability calculations when two or more physical sites are involved. While some of these situations are beyond the scope of the current discussion, some examples are described in the subsections below.
3.4.1 Single environment split between multiple sites with different purposes
Consider the case where a new Web application is to be deployed in an environment, but needs to link to a legacy application deployed in a separate physical location. In this case, both physical locations must simultaneously be available for the overall solution to be available. This follows the same calculation technique as that for the overall solution, the only difference being that there is a second physical site whose availability must be included in the same way as the first.
3.4.2 Single environment split across multiple identical sites for availability
In some situations that require exceptionally high availability, the dependency on a single physical site might be unacceptable; i.e. there is a requirement to support continuous operation in the event that an entire physical site becomes unavailable, due, for example, to power supply failure or a natural catastrophe.
In this case, the two or more physical sites become a cluster; i.e. once the availability of a single site has been calculated, the availability of the two combined can be calculated as in Calculating the availability of clusters, above. Additional elements to take into account are the Internet service providers giving network access to the two sites, whatever failover technologies are used to switch between them, and whatever disk sharing or replication technologies support the availability of data between the two sites.
This topic is beyond the scope of the current discussion. However, if you are considering a solution with such stringent availability requirements, the significant cost and complexity involved indicate that expert skills should be employed in its design and implementation.
3.4.3 Single environment duplicated at a backup site for disaster recovery
This is similar to the previous situation in many ways, but the intent here is rather different: to provide a backup site that can be brought online within a specified recovery period to reinstate processing capability more quickly than can be achieved by repairing or rebuilding the main site. An implication is that the recovery time is rather longer, and that a different set of techniques for failover (possibly manual in this case) and data replication (perhaps by periodic transfer of information on tape) are used.
While the techniques discussed here could be used to analyze Disaster Recovery scenarios, it is more typical to characterize them by recovery times rather than availability. Additionally, disaster recovery is a complex topic beyond the scope of this discussion, and should be addressed by experts.
4. Designing availability solutions in the end-to-end project Lifecycle
4.1 Planning for availability design
Planning for availability is completed through multiple stages of collecting data that will act as inputs into the analysis. Once the inputs are available, analysis is performed to determine if planned availability levels are achievable. With the analysis complete, the enterprise is evaluated in order to determine if current capacity and load requirements meet the planned availability levels. The planning stages go through a four-stage process to determine availability:
- Collect the inputs.
- Analyze the data.
- Verify the analysis makes sense for the input provided.
- Evaluate the enterprise environment.
4.2 Gathering requirements
As described in the The basics of capacity, redundancy and availability section, availability requirements depend on several factors:
- The normal workloads which must be supported (including peaks).
- The acceptable level of planned unavailability (e.g. scheduled outages for upgrade).
- The acceptable level of unplanned unavailability (e.g. how often can outages be tolerated for which applications, and for how long).
- Isolation requirements between applications (e.g. if application A is taken out of service, either unexpectedly or for a planned change, is it acceptable that application B is also taken out of service?).
- What is the cost of service outages, to be weighed against the expense of implementing a highly available infrastructure?
Unfortunately, it is something of a clichÃ© in systems implementation that these requirements, which are a subset of what are often called "Non-Functional Requirements", are fundamentally business requirements, and are rarely well specified by businesses implementing IT systems. Equally, they are rather technical in nature and not commonly understood by the business owners of most IT systems. It is all too common for systems to be built based on assumptions regarding availability, leading inevitably either to inappropriate expense, or applications that perform worse than desired.
A full discussion of the analysis of such requirements is beyond the scope of this paper; however, it is clear that these requirements can only be confirmed by someone in a position of authority in the business owning the applications to be deployed -- it is these people who can state with certainty what is and what is not acceptable, what the costs of outages are, and what spending is justified to reduce their frequency. As the focus of such people is rarely on these topics, it is usually incumbent on a senior technical professional to extract the information from them.
Requirements 2 through 5 listed above are essentially defined by asking questions of the business owners of applications. As the answers are fundamental inputs to the design of the infrastructure (and may also affect aspects of application design such as what state is stored in HTTPSession and what is stored in a database), they need to be answered early in the project lifecycle. If concrete answers are not available by the time the infrastructure design begins (or, worse, implementation), this should be raised as a fundamental risk to project success.
Requirement 1 is a little more complicated; while answering it does require questioning the business, the business may not readily know the answer in any great detail. Some of the following techniques may be required:
- Analysis of logs of existing systems.
- Use of business intelligence or management reporting systems.
- Interviewing staff, e.g. call center operators.
- Estimations based on overall business figures (e.g. number of mortgage products sold per year, number of working days in the year etc.).
If none of those are feasible (although some form of estimation should always be possible), it might be worthwhile implementing a technique to measure existing systems as part of requirements analysis for any new system -- e.g. can any measurement or logging system be put in place to record transaction rates in the existing systems?
Once the workload requirements are assessed, using them to calculate the required server capacity (on which designs for isolation and redundancy are also based) is also not straightforward. While various benchmarks exist for what workload can be supported by what servers on which platforms, overall this area is highly application-dependent. There is really no substitute here for testing. Hence, characterizing workload requirements will often follow an iterative cycle:
- Establish base workload requirements from business requirements.
- Use available benchmarks to provide a first estimate of capacity.
- Implement an early, end-to-end technical prototype, test its performance and throughput, and refine capacity estimates.
- Design and implement the initial infrastructure, with a backup plan for increasing capacity if necessary.
- Repeatedly test any available releases of the application proper and, if necessary, amend the planned infrastructure capacity.
Finally, not that the discussion to this point has implicitly assumed that applications are monolithic, i.e. there is a single set of workload and availability requirements for the entire application. In fact, this is often not the case, and it might be that measures to provide high availability are only necessary for a subset of the application ÂÂ if so, this might provide a means to implement the required availability at a significantly lower cost. For example:
- Try analyzing the requirements by use case; some use cases (e.g. make payments from an online banking account) might require lower availability than others (e.g. order travel insurance brochure).
- Try analyzing the requirements end-to-end through the system architecture (e.g. if payments are captured by a Web application, then placed on a queue for asynchronous processing, it might be that only the Web application and input queues need to be highly available).
4.2.1 The cost of failures and the cost of redundancy
As noted earlier, providing redundant capacity to increase availability comes at a cost that must be offset against the costs of the failures that increased availability prevents. It should be noted that this cost is usually significantly higher than that of simply acquiring additional servers, for instance:
- Providing high availability for database servers involves making the hardware itself more highly available, rather than providing additional redundant hardware. This might involve sophisticated server platforms and real-time disk replication technology. The costs of acquiring, implementing and managing such technology are significantly higher than those of buying additional basic servers.
- The availability of the entire infrastructure is limited by that of the least available component. So providing server redundancy will not help unless power supplies, networks, firewalls etc. are all provided with redundant capacity.
- Defining a comprehensive high availability solution involves an increase in the number of "moving parts" in the overall topology ÂÂ for example, wherever server clusters are used for redundancy, some mechanism to switch from failed to redundant servers is required. This mechanism must be acquired, deployed in a manner that is itself highly available, and managed.
Typically, all the above increase more and more rapidly as ever higher levels of availability are required ÂÂ this should be expected, as no amount of money will ever guarantee 100% availability. It is therefore important to invest the right amount of money in the right availability solution.
As it is not possible to cost a solution without designing it, a good starting point is to calculate the hourly cost of down-time (i.e. how much business or productivity will I lose forever if my application is unavailable for 1 hour? Is that always true, or is the cost less for a regular, planned outage than for a surprise unplanned outage? Are the costs different at different times of the working day or week?). In practice, you will likely need to derive several figures, for example:
- The cost of an unplanned outage of up to 2 hours is a thousand dollars per hour
- The cost of an unplanned outage of over 2 hours but less than one day is b thousand dollars per hour, etc.
- The cost of a planned outage of up to 6 hours overnight is c thousand dollars.
- The cost of a planned outage during working hours where users are given at least 24 hours notice is d thousand dollars per hour ÂÂ¦ and so on.
Additionally, there are some slightly more subtle figures you may need to determine. For example, if you hold "shopping cart" information in memory on your application servers, then the failure of any server will destroy a number of shopping carts unless you take steps to provide persistent storage. So, for any such "stateful" transaction data, you need to understand its value, in the sense of the cost to the business when the data is lost.
Once you understand those basic costs of outages, you next need to determine the acceptable total yearly cost of downtime ÂÂ i.e. how much business or productivity can you afford to lose each year? As the costs of each type of downtime, as described above, vary, you may need to separately understand the acceptable total cost of each type. Once you have this information, you have the basic parameters within which you need to design an availability solution.
4.2.2 Varying workloads
As discussed in previous sections, the starting point for designing appropriate availability solutions is knowledge of the required workload in normal operating conditions. However, it is rare that the workload requirement can be expressed as a simple number, for example:
- Many applications exhibit strong peaks in activity throughout the day (e.g. start of working day, lunchtime, early evening when people return home from work etc.).
- Many applications exhibit peaks through the week (e.g. on Monday morning as the working week starts, Saturday morning as the weekend begins etc.) or are primarily active on particular days (e.g. Web sites relating to television programs or sporting events, applications used during the working week).
- Some applications may exhibit variations through the year, or associated with particular events (e.g. Web sites related to sporting seasons or travel, etc.).
- While one approach is to simply provide enough capacity to run whatever the maximum peak load is, this will likely result in inefficient use of server capacity, particularly if multiple applications are deployed this way in their own infrastructure. Figures 14 and 15 below illustrate the varying hourly workload for three fictitious applications, and the resulting redundant capacity when they are deployed separately or together.
Figure 14. Varying Hourly Workload for Three Applications
Figure 15. Redundant Capacity in Individual vs. Shared Infrastructure
Assessing varying workloads to this level of detail is unlikely to be easy, but the techniques are effectively the same as discussed in the Resources section.
4.2.3 Analyzing Requirements
Using the above information, you should be able to define a set of applications, or partial applications, each of which can be characterized by:
- Acceptable planned unavailability
- Acceptable unplanned unavailability
- Isolation requirements.
First, the isolation and availability requirements enable you to identify which applications can be co-hosted on the same physical servers, or in the same server processes / application servers, as in the section on availability, capacity and redundancy, above. Hence, you can group the applications into application groups that can be co-deployed, based on:
- similar availability requirements.
- compatible isolation requirements.
These requirements also allow you to specify the nature of the availability solution to be used for each group of applications:
- Some applications may be able to tolerate relatively long (e.g. 1 hour) unplanned outages during normal operations, and may be suitable for "cold standby" availability solutions where the entire infrastructure is restarted (or replaced by a separate standby infrastructure). Other applications may not be able to tolerate such outages, and will need to be provided with "hot standby" infrastructure, where failing components are automatically replaced by backups that are already running.
- Some applications may be able to tolerate relatively frequent planned outages of several hours in order to allow the application of upgrades. Upgrade processes can therefore be relatively straightforward as the infrastructure can be brought down to perform them. Other applications may not be able to tolerate such outages, and more complex upgrade processes will be required involving taking individual servers down one at a time to perform upgrades before bringing them back online, all the while keeping the overall infrastructure available.
The next step is to identify the total workload requirement associated with each group. This, combined with the results of testing or benchmarks, will tell you the server capacity required to support normal operations. Finally, you need to decide on the amount of redundant capacity to provide each group to support availability. The spreadsheet associated with this document, and the accompanying section Calculating availability, have been provided to help you to do that. For information concerning the techniques used to support failover and workload management for the redundant capacity, refer to the Resources section.
4.3 Testing the availability solution
The topologies of distributed, high availability environments are complex and contain a large number of "moving parts", or components (switches, replicated disks, monitoring technologies) that are required to actively function to support availability. It may seem obvious, but it is important to comprehensively test these components in order to verify that when a failure does occur, they do their job and maintain availability.
So, in addition to operational and workload testing, high availability environments, and the applications that are deployed in them, should undergo some form of availability testing. However, switching the environment on and waiting for failures to occur is not going to be an efficient means of testing that the various redundant components and failover technologies work. Instead, some means of triggering or simulating failures is required.
So, a comprehensive availability test plan would include, for each component:
- What failures might affect that component (e.g. process failure, hardware failure, network failure)?
- How can each failure be triggered / simulated (e.g. deploying code with a deliberate bug, using a debugger to hang the application)?
- Which person or process has the job of identifying when failures occur? (e.g. is it automatically detected by a heartbeat with a cloned component? Or does systems management software do the job? Or is it a more manual process involving operators?)
- What process or action (if any) should be taken to handle each failure (e.g. identify the workload management component that should switch processing to backup components, identify the manual failover process, or identify the script that should be used to restart the failed component etc.)?
- What test can be applied to verify that the failure was handled correctly? Note that this should include both successfully handling future processing, and successfully handling any processing that was ongoing at the time of failure -- e.g. were in-flight transactions successfully picked up and processed by a backup server?
This information will provide the basis of a set of test cases that can be used to test the high availability features of the infrastructure.
Availability is an achievable service level characteristic that every enterprise grapples with. The worse case scenario is where load is underestimated or hardware/network-bandwidth is swamped because careful planning was not conducted. Avoiding the worse case scenario can be accomplished by following the steps outlined in this paper and using the attached Excel spreadsheet as part of the availability planning exercise in the enterprise.
As detailed in this article, the level of comprehension of core concepts and extensive work required for availability planning is not trivial; however it is well within the capabilities of most IT architects.
|Code sample||AvailabilityCalculator.zip ( HTTP | FTP )||24 KB|
- Advanced Clustering Techniques for Maximizing Web Site Availability with WebSphere Application Server, Version 5
- Server Clusters For High Availability in WebSphere Application Server Network Deployment Edition 5.0
- Many mathematics textbooks describe probability calculations and the use of "combinations" as described here, including Mathematical Methods in the Physical Sciences by Mary L. Boas (Wiley 1983)