Skip to main content

FAWS for SOAP-based Web services

A client-transparent fault tolerance system for SOAP-based Web services

Deepal Jayasinghe (deepal@apache.org), Software Engineer, WSO2
Photo of Deepal Jayasinghe
Deepal Jayasinghe is a Software Engineer for WSO2. His current work involves architecting and developing Apache Axis2 and Apache Synapse for Apache Software foundation. His expertise is in distributed computing, fault-tolerant systems, and Web service-related technologies. You can contact him at deepal@apache.org.

Summary:  Use a portable system called FAWS (FAult tolerance for Web services) to provide a client-transparent fault tolerance for Simple Object Access Protocol- (SOAP) based Web services. As Web services continue to increase in demand for application solutions and with the number of players entering the Web service arena rising, the ability of Web services to guarantee full availability of the service in the presence of failures becomes critical. However, most of the existing fault-tolerant systems for Web services do not provide fault tolerance for transparent handling of requests whose processing was in progress when the failure occurred.

Date:  31 Jan 2005
Level:  Advanced
Activity:  1196 views

Motivation

Web services often suffer from long response times or temporary non-availability. For some classes of Web-based applications, such as online processing of medical images, this kind of behavior just isn't acceptable. In such applications, you need a service that is highly available, processes HTTP requests fast, and allows built-in applications to use the power of high-performance machines. One solution is to make Web services fault tolerant.


Fault tolerance and dependability

Dependability: Fault tolerance is based on dependability, which is a concept that encapsulates the following attributes:

  • Availability means that a system is immediately ready for use. In general, it refers to the probability that the system is operating correctly at any given moment and is available to perform its functions on behalf of its users. In other words, a highly available system is the one that works most of the time.
  • Reliability indicates that a system can run continuously without failure. In contrast to availability, reliability is defined in terms of a time interval instead of an instant in time. A highly reliable system is the one that continues to work without any interruption over a relatively long period of time. This is a subtle, yet important, difference when compared to availability.
  • Safety refers to the probability that a system is either functioning properly or is in failed state. If the system is in failed state, it will not cause any major disruptions. And at that moment, it will not properly perform the functionality. For example, process control systems, such as those used for controlling nuclear power plants and manned space vehicles, are required to provide a high degree of safety. If such control systems temporarily fail, even for a brief moment, the effects could be disastrous.
  • Maintainability refers to how easily a failed system can be repaired. A high-maintenance system might also show a high degree of availability, especially if failures can be detected and repaired automatically.

Fault tolerance is ultimately about keeping the full availability of the system, even in the presence of failure. When designing a dependable fault-tolerant system, it is important to understand the definitions of:

  • Failure. A failure occurs when the delivered service of a system or a component deviates from its specification.
  • Error. An error is a part of the system state that is liable to lead it to failure. An error affecting the service is an indication that a failure is occurring, or has occurred.
  • Fault. A fault is the adjudged or hypothesized cause of an error.

While an error is the manifestation of a fault in the system, a failure is the effect of an error on the service. Therefore, faults are potential sources of system failures. As mentioned above, fault tolerance is the ability to continue to provide service in spite of faults. It can be achieved by error processing and fault treatment. The goal of error processing is to remove errors from the computational state before a failure occurs, whereas the goal of fault treatment is to prevent faults from being activated again.


Why is redundancy a required factor in a fault-tolerant system?

It is impossible to implement a fault-tolerant system without having some form of redundancy, as redundancy is the use of additional information, resources, or time beyond what is needed for normal system operation. There are two types of redundancy: time and space. Time redundancy is based on using extra execution time, whereas space redundancy is based on using extra physical resources, such as extra processors, memory, disks, or communication links. A classic example of space redundancy is process replication. Fault tolerance requires some form of redundancy and redundancy requires some form of replication. Therefore, to provide fault tolerance, it is required to have replication. There are two well-known process replication schemes: active and passive, both of which are fault-tolerant techniques.


Techniques for fault tolerance

As mentioned above, there are two well-known techniques for fault tolerance in a distributed system: the "active replication" and "passive replication".

Active replication
Active replication consumes a lot of resources. In the context of Web services, this means it creates the need for redundant application servers. When the fault-tolerant system receives a request from a client, it is sent to all of the server replicas. Each replica computes the result independently and sends it to the component called "voter". The voter decides the result based on maximum occurrence of the same result and the decided result is returned to the client. Figure 1 below is a pictorial representation of this process.

Figure 1. Active replication with voter
Active replication with voter
Passive replication -- Primary backup
The essential idea of the primary backup replication technique is that at any given instant one server acts as the primary server and does all the work. If the primary server fails, the backup server takes over. Passive replication has two parts. One is passive replication with state update and the other one is passive replication without state update. In the "with state" update scheme, the primary and backup servers update their states simultaneously. Figure 2 shows a simple primary backup system with state update.

Figure 2. Simple primary backup system with state update
Simple Primary Backup System with State Update

As shown in Figure 2, the client sends a request to the primary server (1). The primary server does its work (2) and updates that work on the backup server (3). If the backup server updates its work successfully (4), an acknowledgement (5) will be sent to the primary server. After receiving the acknowledgement, the primary server will compare (6) and send a reply (7) to the client. If the primary server goes down, all requests will be forwarded to the backup server. Since both maintain the same state, there is no need to update the state of the backup server.

In the case of passive replication without a state update scheme, the primary server will not update the backup frequently. The backup has to do a warm start, or otherwise the process would be a stateless process. When the primary server goes down, the backup server will come into the action.


Why is a fault-tolerant system required for Web services?

Web services often suffer from long response times or temporary non-availability. For certain Web-based applications, such as online banking, stock trading, reservation processing, and shopping, such outages are unacceptable. Fault-tolerant techniques can be used to address these problems; however, keep in mind that the concept of fault tolerance is totally different from Reliable Messaging (RM) in Web services. In RM, both client and sever are aware of the protocol and act upon it. In a fault-tolerant situation, the client might not know what is actually happening with the application server; rather, it might only be aware that the server side has implemented a fault-tolerant functionality.

Note:

On the client side, if you have RM support, such as Apache Sandesha, it will handle the message resending so that the client does not need to resend the messages by itself. In this particular case, the client has to be aware of what is happening on the server side, but providing a fault-tolerant system makes the user's (client's) job easier.

Almost all the Web service utilizations are based on Simple Object Access Protocol (SOAP) messages (there are a few based on REST (Representational State Transfer)), but also to be considered are the SOAP-based Web services it lacks from a fault-tolerant capability. There are numerous fault-tolerant systems for real-time applications, but finding a fault-tolerant system for Web services is not an easy task. The major problems faced by clients who use SOAP to access a Web service are service non-availabilities and request resending. When a client sends a request and the server fails while the request is being processed, the client will be disconnected and won't receive the preferred response at all. The client will receive either a connection timeout exception or a socket closed exception. Following this, the client has to resend the SOAP request to access the service. Such undesirable situations can be prevented by the use of a fault-tolerant system for SOAP-based Web services.


Client-transparent and non-transparent fault tolerance

In FAWS, there are two techniques called "client-transparent fault tolerance" and "non-transparent fault tolerance". In the case of client-transparent fault tolerance, the client sends the request and waits for the response -- the logic is handled on the server side, as shown in Figure 3. In the case of non-transparent fault tolerance, the client has a list of service providers and sends the request to the first one. If the client receives a desired response, then everything is fine. But if the client gets a SOAP fault or a transport-level fault (such as HTTP 500), he or she will send the request to the next service provider on the list until the desired responese is obtained, as shown in Figure 4 below.


Figure 3. Client-transparent fault tolerance
Client-transparent fault tolerance
  1. The client sends the request to the published EPR (endpoint reference) for a given service provider.
  2. The application server (having FT functionality) forwards it to one of its application servers.
  3. That application server returns a fault (can be SOAP fault or transport fault).
  4. The same request is forwarded to another application server.
  5. It gives the desired output.
  6. It then sends the response to the client.

Figure 4. Non-transparent fault tolerance
Client non-transparent fault tolerance
  1. The client sends a request to the first EPR on the list.
  2. The application server gets an error from that EPR.
  3. It sends the same request to a different EPR on the list.
  4. It then gets the desired response.

The architecture of FAWS

FAWS is totally based on a distributed object architecture. It consists of several components that operate independently of each other in a distributed network. FAWS guarantees full availability of the whole system, even in a single component failure. This distributed architecture lowers the risk of a complete system failure. It is based on passive replication without state update and provides client-transparent fault tolerance behavior.

The frontend of FAWS acts as the Web server. It also receives SOAP requests. When a request is received, it is forwarded to the primary Web server. The response from the server is sent to the respective client. The frontend logs each received request before it is sent to the primary server. This frees clients from having to resend the request in case of a primary failure. As a result, clients need to only be aware of FAWS's front-end address to access the Web service (could be the published EPR). They need not bother about the underlying primary and secondary servers.

Figure 5 illustrates the high-level components architecture of FAWS. The figure also highlights how the four major components of the system -- FT-Front, FT-Admin, FT-Detector, and FT-Monitor -- interact with the each other, as well as with the primary and backup servers. Each component works independently and communicates with one or more components in the system (for example, FT-Detector communicates with FT-Admin). All of the internal communication among the components are carried out using RMI (Remote Method Invocation).


Figure 5. The components architecture of FAWS
FAWS components architecture

Functionality of the main components

FT-Front: This is the only component that directly interacts with clients. The client sees the FT-Front as the application server that provides the Web services. FT-Front listens for SOAP requests on a specific port, which is configurable using FT-Admin. FT-Front uses RMI to communicate with FT-Admin. FT-Admin starts up FT-Front by providing configuration data, such as the IP address of the primary application server, the maximum number of resends per request, and so forth, as a result of FAWS maintaining a message log (by FT-Front). FT-Front can failover to a new primary server dynamically when it is notified by FT-Admin.

FT-Admin: This component can be considered as the core of the system, which in fact consists of two subcomponents: the administration of the subcomponent and monitoring of the subcomponent (FT-Monitor). FT-Admin manages the fault-tolerant system as a single system and applications as if they were running on a single server. FT-Admin communicates constantly with FT-Detector and FT-Front in order to provide uninterrupted service. FT-Admin provides two services: replication management and configuration management. While replication management is responsible for maintaining replicated servers and failover operations, the configuration management service is responsible for system initialization and changes.

FT-Detector: The fault detecting component of the system is FT-Detector, which detects software failure (process failure), hardware failure (machine failure), and notifies to FT-Admin appropriately. The software-fault detection is done by port scanning. For example, by checking whether a particular port is active (if an application server is up and running), it is possible to decide the status of the server. FT-Detector uses this technology to track Web server failures. The hardware or network failure is detected by using the Internet Control Message Protocol (ICMP protocol). ICMP echo requests are sent to each machine periodically. FT-Detector waits for a certain time period -- the checking period can be set using FT-Monitor and the default time interval is one second -- to receive a reply. It then resends the ICMP request and waits for a reply. If it does not get a reply to the resent request, FT-Detector decides that particular machine has failed.

FT-Monitor: Both FT-Admin and FT-Monitor run as one component. FT-Front is the only component that gives graphical representation of the system, such as how the components are interconnected and their status. It provides two major functions for the system administrator. First, it provides a way to set (or change) the system configuration at the initial stage and at run-time. Secondly, it provides system status and graphical representation of distributed FAWS components.


Message processing flow inside FAWS

FAWS is capable of handling multiple clients simultaneously. The number of clients depends on the size of the incoming queue -- even though FAWS is thread-safe, its request processing needs to be completed. For example purposes, I will only discuss how FAWS operates with one client (see Figure 6).


Figure 6. Message processing steps of FAWS
Message processing steps of FAWS

Figure 6 illustrates communication and message exchanges among FAWS components in both normal operation conditions and faulty conditions. The numbers in the figure correspond to each step in both conditions. The following tables describe each step that corresponds to the above conditions in detail. Table 1 shows FAWS operations under normal (faultless) operations. Table 2 and Table 3 illustrate FAWS operations under faulty conditions.

FAWS operations under faultless conditions
StepDescription
1The client sends the SOAP request to FT-Front.
2FT-Front accepts the client and creates a separate thread named client-thread to handle the client. This starts the client-thread.
3The client-thread running in FT-Front logs the message.
4The client-thread gets the IP address of the current primary server and forwards the request to it.
5The primary server processes the request and sends back the response to the respective client-thread in FT-Front.
6The client-thread sends the received response from the server to the client.
7The client-thread removes the logged SOAP request and terminates.

If any fault or failure is detected by FAWS (FT-Detector), its operation deviates a little from above. Steps 8 to 13 describe the operation of FAWS under such a faulty condition.

Table 2. FAWS under faulty conditions
StepDescription
8FT-Detector detects a failure (either machine failure or Web server process failure) in primary.
9FT-Detector notifies FT-Admin about the failure.
10FT-Admin selects one of the backup servers as the primary and notifies FT-Monitor to show the current status.
11FT-Admin notifies FT-Front about the new primary server. FT-Front changes its current primary address to the new address.
12 When the primary is changed by FT-Admin, FT-Front discards the requests sent to the failed primary and accesses the message log to get the logged messages, which have been sent to the previous primary.
13FT-Front sends the acquired logged requests to the new primary. After that, the system operation is similar to the operation described in steps 5, 6 and 7, respectively.

FT-Front is also capable of detecting primary server failures. This fault-detection system associated with FT-Front is useful in case of a FT-Detector failure. Steps 14 and 15 describe what happens if FT-Front gets an exception while trying to access primary Web server.

Table 3. FAWS under faulty conditions
StepDescription
14Because the primary server has failed, FT-Front gets an exception while sending a received request.
15FT-Front requests a new primary address from FT-Admin. This failure-detection system associated with FT-Front guarantees the full operation of FAWS in case of a FT-Detector failure. After this step, the system operation is similar to the operation described in steps 10, 11, 12, and 13, respectively.

Conclusion

Web services have become a major middleware platform in the technological paradigm, but when it comes to fault tolerance for Web services, it lacks fault-tolerant support. FAWS is available to effectively address fault tolerance for Web services, since it provides the availability of Web services in the presence of failures. Because FAWS ensures transparent fault tolerance, the client is not aware of any server failure and need not resend the request.



Download

DescriptionNameSizeDownload method
FAWS componentsFAWS.zip1014KB FTP | HTTP | Download Director

Information about download methods


Resources

Learn

Discuss

About the author

Photo of Deepal Jayasinghe

Deepal Jayasinghe is a Software Engineer for WSO2. His current work involves architecting and developing Apache Axis2 and Apache Synapse for Apache Software foundation. His expertise is in distributed computing, fault-tolerant systems, and Web service-related technologies. You can contact him at deepal@apache.org.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=SOA and Web services
ArticleID=102967
ArticleTitle=FAWS for SOAP-based Web services
publish-date=01312005
author1-email=deepal@apache.org
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers