Error handling in WebSphere Process Server, Part 1: Developing an error handling strategy

With the emergence of service oriented solutions, we've seen a sharp rise in developer productivity. Developers are empowered with a new found freedom of service construction and reuse. However, with this freedom comes an increased exposure to inconsistent service definitions. These inconsistencies expose weaknesses in error handling and system recovery across the solution. Along with the proper governance controls, IT organizations need to define and enforce the proper error handling strategies tailored for solution recovery. Part 1 of this article series introduces the topic of error handling strategies and highlights key concepts and objectives for developing a strategy.

Sunita Chacko, Managing Consultant, IBM Toronto

Sunita Chacko is a Managing Consultant with Software Services for WebSphere at the IBM Toronto Lab. Sunita divides her time between providing technical expertise to WebSphere Process Server customers and field expertise to the WebSphere development teams. You can reach Sunita at: mailto:schacko@ca.ibm.com.



Jeff Brent, Senior Software Engineer, IBM

Jeff Brent is a Senior Software Engineer for WebSphere Process Server development in West Palm Beach Florida. Jeff is the lead for the WebSphere Process Server SWAT team and spends much if his time working with customers resolving deep technical issues. In his free time, he enjoys spending time with his family at the beach and playing basketball. You can reach Jeff at: jeffb@us.ibm.com.



29 October 2008

Introduction

Developers have always used strategies and standards to organize development efforts, reduce complexity, and promote consistency in the creation of artifacts and deliverables.

  • We write standard bodies.
  • We write and follow open source specifications.

This article provides an introduction to "What is an Error Handling Strategy?" and highlights key concepts and objectives for developing a strategy and the subsequent project standards.

Subsequent articles in the series will provide:

  • A working example of usage of Transaction qualifiers to define Unit of Work scopes.
  • Detailed information with respect to the invocation patterns that are used by the various component types and how to control the usage of asynchronous and synchronous invocation protocols.
  • An example of an error handling framework that will leverage common product capabilities such as human tasks to establish an error processing solution.

Prerequisites for this series

This series requires that you install the following products:

  • IBM® WebSphere® Process Server V6.1.2
  • WebSphere Enterprise Service Bus V6.1.2

Why do I need an error handling strategy?

Error handling and prevention is the most often overlooked non-functional requirement. A strategy is needed to standardize development practices. Every development or definition language provides tools for error handling and prevention.

  • Java™ has a try/catch block, log4J and assorted frameworks.
  • WSDL provides capability for defining faults and declared exceptions.

Unfortunately, more often than not, projects are not using the tools for error handling and prevention effectively. The lack of standardization leads to instability and unnecessary gaps in application integrity. This is especially common when production "projects" grow from proof of technology "pilots".

These strategies must be developed up front by the solution and enterprise architecture teams so that the greater development organization will be positioned to achieve the number one design goal: Consistent and successful recovery of business transactions and data.

With consistency comes predictability. These two qualities are vitally important for error prevention and system recovery.

What is an error handling strategy?

An error handling strategy is developed by understanding the business needs and applying common patterns of development to manage application and infrastructure volatility. By using these patterns, integration developers are able to quickly assemble consistent applications that provide and consume services from common end points.

The strategy establishes a recovery framework and standards enforce the usage of the framework.

Where do I begin?

In much the same way we collect non-functional requirements for performance, we need to define and record the objectives and service level agreements for error handling and recovery. This is the most important step of the process. There are requirements that must be gathered from the Information Technology and Operations teams about how the system should behave. They require systems that are consistent and transactionally sound. This aids in the development of operations manuals and procedures.

An error handling strategy should address three important objectives:

  1. Recognize
  2. Record
  3. React

Recognition

Recognizing that there is a problem is the first line of defense. If everything is behaving as expected or "happy path" then there is no need for an error handling strategy and standard. The fact is that there are bound to be system failures and unexpected phenomenon. Applications should be structured in such a way that exceptions are properly classified. This is the key to the recognition stage.

Exception types

There are different types of exceptions. Determining which are expected or unexpected will further enhance the ability to properly disposition the event. Declared exceptions should be handled in an automated fashion where possible to alleviate the strain on the system administration and operations teams. The following are examples of expected exceptions:

  • Debt exceeds the available amount
  • Invalid product code

Operations teams are typically involved in the correction of unexpected conditions such as "Database or System unavailable".

Techniques for recognizing errors

After determining which types of errors can be expected to occur (and defining how these will be represented), there are a variety of tools and techniques available to developers for identifying errors and classifying them within their application. An error handling strategy should define which of these methods to use and when to use them, based on the application requirements, design, and architecture.

Fault handlers in BPEL

When using BPEL to choreograph service invocations, we use fault handlers to define how a business process should behave when specified faults occur. Put another way, fault handlers are the tool business process developers use to define which exceptions the process should react to, and how.

Fault handlers may be configured to catch faults defined on the process component's reference interface(s), or one of the built-in faults. The Business Process engine detects when a declared or built-in fault is raised at runtime, and triggers the appropriate fault handler, if one exists. There is also an option to create a "Catch All" block, to be triggered if any unhandled error occurs, but it is recommended that developers use this with caution in a well-defined error handling approach. In general, it is a better practice to use a more specific fault handler. For example, if the goal is to catch all ServiceRuntimeExceptions, a fault handler for the built-in "runtimeFault" may be used instead.

Figure 1. User-defined and built-in faults
User-defined and built-in faults

WebSphere Process Server's process engine defines a number of built-in faults. These faults should be understood and used consistently across the development effort.

Additionally, there are a number of samples and examples available for how to set up and define a fault handling routine: on the BPM samples and tutorials Web site, under Advanced BPEL features > Fault Handling.

Once the team is familiar with the capabilities of the product, then based on the error handling requirements and design, the team can decided when/where/why to place fault handlers at specific scopes. This then becomes the standard and a vast majority of the BPEL processes will behave consistently to adverse circumstances.

Failure nodes and unmodeled faults in WebSphere Enterprise Service Bus

The mediation tooling in WebSphere Enterprise Service Bus (hereafter known as WESB) provides features for handling both declared and undeclared faults.

If the interface of the target service (to which you are mediating) has a fault declared, a Callout Fault node will be automatically added to the response flow of your mediation. To define how your mediation behaves when this fault occurs, simply create a wire from the callout fault to the appropriate node in your flow.

To handle unmodeled or undeclared faults in either a request or response flow, create wires from the fail terminals of the node where the error may occur. Fail terminals are the rectangular terminals found on the right of each node. They are triggered whenever an unexpected error occurs during the execution of the node.

Figure 2. Callout Fault and Fail Terminals in a Response Flow
Callout Fault and Fail Terminals in a Response Flow

For an example of using fail terminals, please see the following article: Handling unmodeled faults within WebSphere Process Server V6.1

Using Java

Whether you are creating Java components or are using Java snippets in BPEL or Interface maps, a standardized technique for handling errors should be defined and documented. Java development efforts should leverage the provided "try/catch" capability of the chosen implementation. Standardizing the error handling requirements for custom development projects will greatly increase the productivity of all subsequent problem determination exercises.

Java components

Java components use the same try catch keywords provided by the Java runtime. WebSphere Process Server's programming model provides an implementation of declared and undeclared exceptions.

ServiceRuntimeExceptions
Undeclared and unexpected error types. Typically, these are created only by the runtime.
ServiceBusinessExceptions
Declared and expected error types. These are defined on the service interface and programmatically created.

Examples of exception handling for Java components can be found in the following article: Exception handling in WebSphere Process Server and WebSphere Enterprise Service Bus

Regardless of the "catch" capability or the component type used, fundamentally the recognition phase should contain at least the following basic principles for custom Java projects:

Capturing
Using the appropriate capability to capture the state of the system as close to the root cause as possible.
Classifying
Organizing the errors into groups such as expected or unexpected.

Record

After the appropriate recognition and classification of a problem or exceptions, it is critical to record data about the state of the system for future recovery. This data is essential to the problem determination effort that will follow in the React phase of the strategy.

Like the recognition phase, the record has an important classification effort. Our error handling strategy should consider the differences between recording business and IT events. Although similar, it would not be accurate to assume that all IT events are undeclared and all business events are declared. The mechanism that we will use to record the incident will likely be different.

IT events and logging

Generally, IT events are recorded to a log and/or emitted for IT monitoring products like Tivoli. Since debugging and other problem determination activities, may required a precise and detailed record of what has happened, the volume of detail we chose to record are better suited for logs and collectors. The project should set up a common logging framework and standard. The logging frameworks can be implemented using configurable tools such as java.util.Logging (JSR47 logging).

The standard defined by your local team should define the frequency and depth of each log entry. There are many things to consider:

  • Where are trace entries standard?
  • Method entry and exits?
  • How do we standardize the usage of message severity?

The strategy should define the standard for logging/trace and enforce the usage of the standard with peer or automated code reviews.

Figure 3. JSR47 Logging using Visual Snippets
JSR47 Logging using Visual Snippets

Business events and CEI

Business Events can be emitted via in a WebSphere environment using the Common Event Infrastructure. Much like logging IT events, it is important to establish a standard for when/where/why these business events are emitted. Emitting too much or too little will have different but adverse effects. For example, if the entire event digest is included in each event and a given event is logged a number of times, then it is easy to see that the additional load on the system would be significant enough to consider the scenario during load and capacity planning. If events are not emitted enough, then the value of the data will suffer.


React

Finally, an error handling strategy needs to put the proper previsions in place to allow the IT and business organizations react to the adverse condition. There are three ways a system can react to errors:

  1. Automated recovery
  2. Trigger manual corrective procedures (by a person such as an administrator)
  3. Switch to an alternate processing path

The timely and appropriate reaction to a reported condition will tremendously improve the resiliency of the deployed solution. This will in turn increase satisfaction with the provided business services as well as reduce the total cost of ownership of the solution.

Automated recovery

Automated recovery within the "React Phase" of the error handling strategy implies that where possible, the application should automatically ensure that the system stays in a consistent and predictable state.

Transactions

Transaction scopes can be configured and used to maintain the consistent state of resources during the course of an unexpected condition. XA Transactions may be used to create scopes around capable resources. A practical example of the configuration necessary to create and manage transaction scopes will be covered in subsequent articles.

BPEL compensation

Compensation of microflows and long-running processes can be used to "undo" the outcome of service invocations that have already completed. It is used when choreographing non-transactional services. (If all the services were transactional, you could have them participate in a single transaction).

In long-running processes, compensation of activities that have successfully executed is initially triggered by a fault raised in the process, or can be explicitly triggered using a compensation activity. This is a useful technique for reversing the effects of already-committed transactions within a long-running process.

Figure 4. Compensation handlers and activities
Compensation handlers and activities

Compensation in microflows is slightly different. If an error occurs during execution of the microflow, the microflow's transaction is marked for rollback; the entire microflow and all the transactional services it invokes will be rolled back. If the process interacts with non-transactional services, this can lead to an inconsistent state across the system. Instead, the business process developer should define compensation services for the activities that invoke non-transaction services. These will be executed during microflow compensation (when the microflow's transaction is rolled back).

Figure 5. Compensation in microflows
Compensation in microflows

For more information and examples, please follow these links:

Automatic retries

Additionally, there may be cases where automatic retries can be used to increase the resilience of the application to intermittent errors. WebSphere Process Server and Enterprise Service Bus provide a retry capability automatically under some circumstances; such as when making asynchronous invocations. Identifying when and where retries will be provided "for free" will help the architecture team determine which implementation patterns and invocation styles should be used. The topic of "retries" will be addressed in more detail with subsequent articles.

Manual correction

There will always be unforeseen errors, or errors that the system cannot handle automatically. In these cases, the application developer may choose to have the application "hold" and wait, to allow an administrator (or other human) to manually diagnose and correct the error.

BPEL "Continue on Error" property

On BPEL processes and activities, a developer may set the "continue on error" property to "false". If the activity fails at runtime, the Business Process engine will transition the activity to the "stopped" state, and the process will not proceed to the next activity. The BPC explorer will show the process with the stopped activity under the Critical Processes view, and in some cases, the reason for failure. Stopped activities may be resubmitted or terminated by the process administrator.

Figure 6. Continue on Error, Process-level and Activity-level Setting
Continue on Error, Process-level and Activity-level Setting

Failed events

Failed events are created automatically by the WebSphere Process Server product. The architecture team should understand exactly when failed events are created and establish the appropriate development standards to leverage this product capability. These failed events will be stored in the system until they are dispositioned by an administrator or programmatically by the Failed Event Manager APIs. These APIs are documented in the web/ directory of your WPS test server installation: <WID-install-dir>/runtimes/bi_v61/web/mbeanDocs/FailedEventManagerMBean.html

Additional information regarding the failed event manager's capabilities, please refer to the Failed Event Manager documentation in the appropriate information center for your specific product version.

Failed Event Manager Documentation for WebSphere Process Server 6.1

Notification

In the case of manual intervention, administrators or the staff responsible for the manual activity would need to be notified. The error handling strategy needs to document how the notifications would be sent, how frequently, who is responsible (role) for acting upon the event. As described previously, the notification mechanism will likely differ based on whether the error condition is an IT level error or a business related problem.

Switch to an alternate processing path

Sometimes, the business process definition accounts for error conditions. These are truly business-level errors, rather than IT level errors, and should be handled accordingly. Specifically, the logic for handling these error conditions is part of the modeled business process flow, rather than one of the techniques described in earlier sections. For example, in a human-centric workflow, all exception cases may be routed to a senior employee for processing. Or, in a business process defined using BPEL, you can take advantage of the "Case" and "Otherwise" clauses of the "Choice" construct to model If-Then-Else conditions.


Evaluate and establish a system framework

These service level agreements will have a direct impact on the types of error handling frameworks that are required. After we have established the requirements for the framework, we can then begin to evaluate the provided product capabilities that we can leverage for error handling and recovery. You can use and combine each of the following product capabilities to establish a framework within the strategy:

  • Business processes
  • Human tasks
  • Failed Event Manager
  • Common Event Infrastructure
  • Business Activity Monitoring

The capability of the frameworks will define the operational procedures needed in the event of system loss or instability. It is the developer's job to create applications that leverage the framework and the chosen product capability.

The scope of automated recovery is as broad as the imagination. There are many capabilities and tools that can be created with the product provided. After all it is a process automation, integration, and BPM product. Published APIs and interfaces to the Failed Event Manager and the Business Process Container enable limitless possibilities to automate exception processing. The extent of the automated recovery should be discussed and defined before starting the project. As previously stated, the artifacts that are being developed will need to be created with a common theme.


Evaluate services and service providers

The selected framework could be affected by characteristics of the service such as:

  • Batch windows and system availability
  • Lack of transaction support
  • Message tolerance/Assured Delivery: At least once, once and only once
  • Reliability of the protocol

As you can see, these factors are usually directly related to the endpoints that are used in the solution and their capabilities. When developing the error handling strategy, you should evaluate each endpoint by a number of characteristics. Then the architecture team can develop common patterns and templates for dealing with the common sets of issues. Along the way, a standard (that is, always handle faults) is created. Adhering to the standard will increase the productivity of the development team and simplify process definitions as "all endpoints behave the same".

Operation types

Fundamentally, each service interface can have any number of operations. However, one-way operations have different requirements than request response operations. Develop a standard for how to call and manage exceptions from the different types.

Bindings

Web service imports provide different qualities of service than JMS imports. Therefore each import binding type used within the solution should have a standard interaction. The standard interaction should be developed to normalize the behaviors across the bindings to promote the "all endpoints behave the same" philosophy.

Invocation patterns and transactions

Messaging endpoints have a requirement for asynchronous invocation. The client has to create and commit a message before the Service Provider can provide an asynchronous response. This pattern of behavior should be standardized and made consistent. Additionally, all invocation pattern decisions should be made with a clear understanding of its benefits and the impacts. A poorly designed application that uses an excessive amount of asynchronous interaction whether intentionally or by mistake can substantially increase the necessary amount of error handling logic and significantly reduce the throughput of the system.

Availability

Some endpoints are not always available. Whether the outage is planned or unplanned, the team should create a standard strategy for all endpoints to represent a contract for the conditions where the service is unavailable. The error handling strategy should anticipate this fact and provide a standard that promotes quick and easy problem determination and resolution. The standardization of this information also simplifies the creation of the project's operation manual.

Security

There are a wide number of different security options. This article series can not address the vast security domain. However, it is a solid recommendation to approach security in as a consistent manner as possible. Additionally, security related errors should be anticipated and addressed. Again, this will expedite problem resolution and standardize operational procedures.

Performance and throughput

Non-functional performance requirements can have a profound impact on the planning for error handling. Recovery requirements for batch style processing would be different than service providers that are used for synchronous queries. There are several questions that must be answered with respect to performance and throughput.


Bringing it all together

The total solution will be unique to the circumstances, requirements, and preferences of the local team. However, there are design patterns that you can use with every BPM project. This article was provided to start the conversation about error handling strategies and the concepts that are associated to it. In subsequent articles, we will define specific patterns of error handling that can be established as contributions to your local project's standards.

Conclusion

Developing an error handling strategy is a very important and often overlooked design activity. The strategy should precede the design and implementation of business requirements as the strategy will establish standards that the implementation must follow. Each error handling strategy is unique to the given project circumstances. However, you can borrow and customize common patterns to meet your specific needs and service level agreements.

Resources

Learn

Get products and technologies

  • Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Business process management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Business process management, WebSphere
ArticleID=348638
ArticleTitle=Error handling in WebSphere Process Server, Part 1: Developing an error handling strategy
publish-date=10292008