Web services often suffer from long response times or temporary non-availability. For some classes of Web-based applications, such as online processing of medical images, this kind of behavior just isn't acceptable. In such applications, you need a service that is highly available, processes HTTP requests fast, and allows built-in applications to use the power of high-performance machines. One solution is to make Web services fault tolerant.
Fault tolerance and dependability
Dependability: Fault tolerance is based on dependability, which is a concept that encapsulates the following attributes:
- Availability means that a system is immediately ready for use. In general, it refers to the probability that the system is operating correctly at any given moment and is available to perform its functions on behalf of its users. In other words, a highly available system is the one that works most of the time.
- Reliability indicates that a system can run continuously without failure. In contrast to availability, reliability is defined in terms of a time interval instead of an instant in time. A highly reliable system is the one that continues to work without any interruption over a relatively long period of time. This is a subtle, yet important, difference when compared to availability.
- Safety refers to the probability that a system is either functioning properly or is in failed state. If the system is in failed state, it will not cause any major disruptions. And at that moment, it will not properly perform the functionality. For example, process control systems, such as those used for controlling nuclear power plants and manned space vehicles, are required to provide a high degree of safety. If such control systems temporarily fail, even for a brief moment, the effects could be disastrous.
- Maintainability refers to how easily a failed system can be repaired. A high-maintenance system might also show a high degree of availability, especially if failures can be detected and repaired automatically.
Fault tolerance is ultimately about keeping the full availability of the system, even in the presence of failure. When designing a dependable fault-tolerant system, it is important to understand the definitions of:
- Failure. A failure occurs when the delivered service of a system or a component deviates from its specification.
- Error. An error is a part of the system state that is liable to lead it to failure. An error affecting the service is an indication that a failure is occurring, or has occurred.
- Fault. A fault is the adjudged or hypothesized cause of an error.
While an error is the manifestation of a fault in the system, a failure is the effect of an error on the service. Therefore, faults are potential sources of system failures. As mentioned above, fault tolerance is the ability to continue to provide service in spite of faults. It can be achieved by error processing and fault treatment. The goal of error processing is to remove errors from the computational state before a failure occurs, whereas the goal of fault treatment is to prevent faults from being activated again.
Why is redundancy a required factor in a fault-tolerant system?
It is impossible to implement a fault-tolerant system without having some form of redundancy, as redundancy is the use of additional information, resources, or time beyond what is needed for normal system operation. There are two types of redundancy: time and space. Time redundancy is based on using extra execution time, whereas space redundancy is based on using extra physical resources, such as extra processors, memory, disks, or communication links. A classic example of space redundancy is process replication. Fault tolerance requires some form of redundancy and redundancy requires some form of replication. Therefore, to provide fault tolerance, it is required to have replication. There are two well-known process replication schemes: active and passive, both of which are fault-tolerant techniques.
Techniques for fault tolerance
As mentioned above, there are two well-known techniques for fault tolerance in a distributed system: the "active replication" and "passive replication".
- Active replication
- Active replication consumes a lot of resources. In the context of Web services, this means it creates the need for redundant application servers. When the fault-tolerant system receives a request from a client, it is sent to all of the server replicas. Each replica computes the result independently and sends it to the component called "voter". The voter decides the result based on maximum occurrence of the same result and the decided result is returned to the client. Figure 1 below is a pictorial representation of this process.
Figure 1. Active replication with voter

- Passive replication -- Primary backup
- The essential idea of the primary backup replication technique is that at any given instant one server acts as the primary server and does all the work. If the primary server fails, the backup server takes over. Passive replication has two parts. One is passive replication with state update and the other one is passive replication without state update. In the "with state" update scheme, the primary and backup servers update their states simultaneously. Figure 2 shows a simple primary backup system with state update.
Figure 2. Simple primary backup system with state update

As shown in Figure 2, the client sends a request to the primary server (1). The primary server does its work (2) and updates that work on the backup server (3). If the backup server updates its work successfully (4), an acknowledgement (5) will be sent to the primary server. After receiving the acknowledgement, the primary server will compare (6) and send a reply (7) to the client. If the primary server goes down, all requests will be forwarded to the backup server. Since both maintain the same state, there is no need to update the state of the backup server.
In the case of passive replication without a state update scheme, the primary server will not update the backup frequently. The backup has to do a warm start, or otherwise the process would be a stateless process. When the primary server goes down, the backup server will come into the action.
Why is a fault-tolerant system required for Web services?
Web services often suffer from long response times or temporary non-availability. For certain Web-based applications, such as online banking, stock trading, reservation processing, and shopping, such outages are unacceptable. Fault-tolerant techniques can be used to address these problems; however, keep in mind that the concept of fault tolerance is totally different from Reliable Messaging (RM) in Web services. In RM, both client and sever are aware of the protocol and act upon it. In a fault-tolerant situation, the client might not know what is actually happening with the application server; rather, it might only be aware that the server side has implemented a fault-tolerant functionality.
Almost all the Web service utilizations are based on Simple Object Access Protocol (SOAP) messages (there are a few based on REST (Representational State Transfer)), but also to be considered are the SOAP-based Web services it lacks from a fault-tolerant capability. There are numerous fault-tolerant systems for real-time applications, but finding a fault-tolerant system for Web services is not an easy task. The major problems faced by clients who use SOAP to access a Web service are service non-availabilities and request resending. When a client sends a request and the server fails while the request is being processed, the client will be disconnected and won't receive the preferred response at all. The client will receive either a connection timeout exception or a socket closed exception. Following this, the client has to resend the SOAP request to access the service. Such undesirable situations can be prevented by the use of a fault-tolerant system for SOAP-based Web services.
Client-transparent and non-transparent fault tolerance
In FAWS, there are two techniques called "client-transparent fault tolerance" and "non-transparent fault tolerance". In the case of client-transparent fault tolerance, the client sends the request and waits for the response -- the logic is handled on the server side, as shown in Figure 3. In the case of non-transparent fault tolerance, the client has a list of service providers and sends the request to the first one. If the client receives a desired response, then everything is fine. But if the client gets a SOAP fault or a transport-level fault (such as HTTP 500), he or she will send the request to the next service provider on the list until the desired responese is obtained, as shown in Figure 4 below.
Figure 3. Client-transparent fault tolerance

- The client sends the request to the published EPR (endpoint reference) for a given service provider.
- The application server (having FT functionality) forwards it to one of its application servers.
- That application server returns a fault (can be SOAP fault or transport fault).
- The same request is forwarded to another application server.
- It gives the desired output.
- It then sends the response to the client.
Figure 4. Non-transparent fault tolerance

- The client sends a request to the first EPR on the list.
- The application server gets an error from that EPR.
- It sends the same request to a different EPR on the list.
- It then gets the desired response.
FAWS is totally based on a distributed object architecture. It consists of several components that operate independently of each other in a distributed network. FAWS guarantees full availability of the whole system, even in a single component failure. This distributed architecture lowers the risk of a complete system failure. It is based on passive replication without state update and provides client-transparent fault tolerance behavior.
The frontend of FAWS acts as the Web server. It also receives SOAP requests. When a request is received, it is forwarded to the primary Web server. The response from the server is sent to the respective client. The frontend logs each received request before it is sent to the primary server. This frees clients from having to resend the request in case of a primary failure. As a result, clients need to only be aware of FAWS's front-end address to access the Web service (could be the published EPR). They need not bother about the underlying primary and secondary servers.
Figure 5 illustrates the high-level components architecture of FAWS. The figure also highlights how the four major components of the system -- FT-Front, FT-Admin, FT-Detector, and FT-Monitor -- interact with the each other, as well as with the primary and backup servers. Each component works independently and communicates with one or more components in the system (for example, FT-Detector communicates with FT-Admin). All of the internal communication among the components are carried out using RMI (Remote Method Invocation).
Figure 5. The components architecture of FAWS

Functionality of the main components
FT-Front: This is the only component that directly interacts with clients. The client sees the FT-Front as the application server that provides the Web services. FT-Front listens for SOAP requests on a specific port, which is configurable using FT-Admin. FT-Front uses RMI to communicate with FT-Admin. FT-Admin starts up FT-Front by providing configuration data, such as the IP address of the primary application server, the maximum number of resends per request, and so forth, as a result of FAWS maintaining a message log (by FT-Front). FT-Front can failover to a new primary server dynamically when it is notified by FT-Admin.
FT-Admin: This component can be considered as the core of the system, which in fact consists of two subcomponents: the administration of the subcomponent and monitoring of the subcomponent (FT-Monitor). FT-Admin manages the fault-tolerant system as a single system and applications as if they were running on a single server. FT-Admin communicates constantly with FT-Detector and FT-Front in order to provide uninterrupted service. FT-Admin provides two services: replication management and configuration management. While replication management is responsible for maintaining replicated servers and failover operations, the configuration management service is responsible for system initialization and changes.
FT-Detector: The fault detecting component of the system is FT-Detector, which detects software failure (process failure), hardware failure (machine failure), and notifies to FT-Admin appropriately. The software-fault detection is done by port scanning. For example, by checking whether a particular port is active (if an application server is up and running), it is possible to decide the status of the server. FT-Detector uses this technology to track Web server failures. The hardware or network failure is detected by using the Internet Control Message Protocol (ICMP protocol). ICMP echo requests are sent to each machine periodically. FT-Detector waits for a certain time period -- the checking period can be set using FT-Monitor and the default time interval is one second -- to receive a reply. It then resends the ICMP request and waits for a reply. If it does not get a reply to the resent request, FT-Detector decides that particular machine has failed.
FT-Monitor: Both FT-Admin and FT-Monitor run as one component. FT-Front is the only component that gives graphical representation of the system, such as how the components are interconnected and their status. It provides two major functions for the system administrator. First, it provides a way to set (or change) the system configuration at the initial stage and at run-time. Secondly, it provides system status and graphical representation of distributed FAWS components.
Message processing flow inside FAWS
FAWS is capable of handling multiple clients simultaneously. The number of clients depends on the size of the incoming queue -- even though FAWS is thread-safe, its request processing needs to be completed. For example purposes, I will only discuss how FAWS operates with one client (see Figure 6).
Figure 6. Message processing steps of FAWS

Figure 6 illustrates communication and message exchanges among FAWS components in both normal operation conditions and faulty conditions. The numbers in the figure correspond to each step in both conditions. The following tables describe each step that corresponds to the above conditions in detail. Table 1 shows FAWS operations under normal (faultless) operations. Table 2 and Table 3 illustrate FAWS operations under faulty conditions.
| Step | Description |
|---|---|
| 1 | The client sends the SOAP request to FT-Front. |
| 2 | FT-Front accepts the client and creates a separate thread named client-thread to handle the client. This starts the client-thread. |
| 3 | The client-thread running in FT-Front logs the message. |
| 4 | The client-thread gets the IP address of the current primary server and forwards the request to it. |
| 5 | The primary server processes the request and sends back the response to the respective client-thread in FT-Front. |
| 6 | The client-thread sends the received response from the server to the client. |
| 7 | The client-thread removes the logged SOAP request and terminates. |
If any fault or failure is detected by FAWS (FT-Detector), its operation deviates a little from above. Steps 8 to 13 describe the operation of FAWS under such a faulty condition.
| Step | Description |
|---|---|
| 8 | FT-Detector detects a failure (either machine failure or Web server process failure) in primary. |
| 9 | FT-Detector notifies FT-Admin about the failure. |
| 10 | FT-Admin selects one of the backup servers as the primary and notifies FT-Monitor to show the current status. |
| 11 | FT-Admin notifies FT-Front about the new primary server. FT-Front changes its current primary address to the new address. |
| 12 | When the primary is changed by FT-Admin, FT-Front discards the requests sent to the failed primary and accesses the message log to get the logged messages, which have been sent to the previous primary. |
| 13 | FT-Front sends the acquired logged requests to the new primary. After that, the system operation is similar to the operation described in steps 5, 6 and 7, respectively. |
FT-Front is also capable of detecting primary server failures. This fault-detection system associated with FT-Front is useful in case of a FT-Detector failure. Steps 14 and 15 describe what happens if FT-Front gets an exception while trying to access primary Web server.
| Step | Description |
|---|---|
| 14 | Because the primary server has failed, FT-Front gets an exception while sending a received request. |
| 15 | FT-Front requests a new primary address from FT-Admin. This failure-detection system associated with FT-Front guarantees the full operation of FAWS in case of a FT-Detector failure. After this step, the system operation is similar to the operation described in steps 10, 11, 12, and 13, respectively. |
Web services have become a major middleware platform in the technological paradigm, but when it comes to fault tolerance for Web services, it lacks fault-tolerant support. FAWS is available to effectively address fault tolerance for Web services, since it provides the availability of Web services in the presence of failures. Because FAWS ensures transparent fault tolerance, the client is not aware of any server failure and need not resend the request.
| Description | Name | Size | Download method |
|---|---|---|---|
| FAWS components | FAWS.zip | 1014KB |
FTP
|
Information about download methods
Learn
- Learn more about FAult tolerance for Web services (FAWS):
- FT-SOAP: A Fault Tolerant Web Service
- Client-Transparent Fault-Tolerant Web Service
- Fast Transparent Failover for Reliable Web Services
- Implementation and Evaluation of Transparent Fault-Tolerant Web Service with Kernel-Level Support
- Apache Sandesha is an implementation of the Web Services ReliableMessaging (WS-ReliableMessaging), published by IBM, Microsoft, BEA, and TIBCO Software as a
joint specification, on top of Apache Axis.
- SOA and Web services -- hosts hundreds of informative articles and introductory, intermediate, and advanced tutorials on how to develop Web services applications.
- The IBM developerWorks team hosts hundreds of technical briefings around the world which you can attend at no charge.
Discuss
- developerWorks blogs -- get involved in the developerWorks community.

Deepal Jayasinghe is a Software Engineer for WSO2. His current work involves architecting and developing Apache Axis2 and Apache Synapse for Apache Software foundation. His expertise is in distributed computing, fault-tolerant systems, and Web service-related technologies. You can contact him at deepal@apache.org.
Comments (Undergoing maintenance)





