Developers have always used strategies and standards to organize development efforts, reduce complexity, and promote consistency in the creation of artifacts and deliverables.
- We write standard bodies.
- We write and follow open source specifications.
This article provides an introduction to "What is an Error Handling Strategy?" and highlights key concepts and objectives for developing a strategy and the subsequent project standards.
Subsequent articles in the series will provide:
- A working example of usage of Transaction qualifiers to define Unit of Work scopes.
- Detailed information with respect to the invocation patterns that are used by the various component types and how to control the usage of asynchronous and synchronous invocation protocols.
- An example of an error handling framework that will leverage common product capabilities such as human tasks to establish an error processing solution.
Prerequisites for this series
This series requires that you install the following products:
- IBM® WebSphere® Process Server V6.1.2
- WebSphere Enterprise Service Bus V6.1.2
Why do I need an error handling strategy?
Error handling and prevention is the most often overlooked non-functional requirement. A strategy is needed to standardize development practices. Every development or definition language provides tools for error handling and prevention.
- Java™ has a try/catch block, log4J and assorted frameworks.
- WSDL provides capability for defining faults and declared exceptions.
Unfortunately, more often than not, projects are not using the tools for error handling and prevention effectively. The lack of standardization leads to instability and unnecessary gaps in application integrity. This is especially common when production "projects" grow from proof of technology "pilots".
These strategies must be developed up front by the solution and enterprise architecture teams so that the greater development organization will be positioned to achieve the number one design goal: Consistent and successful recovery of business transactions and data.
With consistency comes predictability. These two qualities are vitally important for error prevention and system recovery.
What is an error handling strategy?
An error handling strategy is developed by understanding the business needs and applying common patterns of development to manage application and infrastructure volatility. By using these patterns, integration developers are able to quickly assemble consistent applications that provide and consume services from common end points.
The strategy establishes a recovery framework and standards enforce the usage of the framework.
Where do I begin?
In much the same way we collect non-functional requirements for performance, we need to define and record the objectives and service level agreements for error handling and recovery. This is the most important step of the process. There are requirements that must be gathered from the Information Technology and Operations teams about how the system should behave. They require systems that are consistent and transactionally sound. This aids in the development of operations manuals and procedures.
An error handling strategy should address three important objectives:
Recognizing that there is a problem is the first line of defense. If everything is behaving as expected or "happy path" then there is no need for an error handling strategy and standard. The fact is that there are bound to be system failures and unexpected phenomenon. Applications should be structured in such a way that exceptions are properly classified. This is the key to the recognition stage.
There are different types of exceptions. Determining which are expected or unexpected will further enhance the ability to properly disposition the event. Declared exceptions should be handled in an automated fashion where possible to alleviate the strain on the system administration and operations teams. The following are examples of expected exceptions:
- Debt exceeds the available amount
- Invalid product code
Operations teams are typically involved in the correction of unexpected conditions such as "Database or System unavailable".
Techniques for recognizing errors
After determining which types of errors can be expected to occur (and defining how these will be represented), there are a variety of tools and techniques available to developers for identifying errors and classifying them within their application. An error handling strategy should define which of these methods to use and when to use them, based on the application requirements, design, and architecture.
Fault handlers in BPEL
When using BPEL to choreograph service invocations, we use fault handlers to define how a business process should behave when specified faults occur. Put another way, fault handlers are the tool business process developers use to define which exceptions the process should react to, and how.
Fault handlers may be configured to catch faults defined on the process component's reference interface(s), or one of the built-in faults. The Business Process engine detects when a declared or built-in fault is raised at runtime, and triggers the appropriate fault handler, if one exists. There is also an option to create a "Catch All" block, to be triggered if any unhandled error occurs, but it is recommended that developers use this with caution in a well-defined error handling approach. In general, it is a better practice to use a more specific fault handler. For example, if the goal is to catch all ServiceRuntimeExceptions, a fault handler for the built-in "runtimeFault" may be used instead.
Figure 1. User-defined and built-in faults
WebSphere Process Server's process engine defines a number of built-in faults. These faults should be understood and used consistently across the development effort.
Additionally, there are a number of samples and examples available for how to set up and define a fault handling routine: on the BPM samples and tutorials Web site, under Advanced BPEL features > Fault Handling.
Once the team is familiar with the capabilities of the product, then based on the error handling requirements and design, the team can decided when/where/why to place fault handlers at specific scopes. This then becomes the standard and a vast majority of the BPEL processes will behave consistently to adverse circumstances.
Failure nodes and unmodeled faults in WebSphere Enterprise Service Bus
The mediation tooling in WebSphere Enterprise Service Bus (hereafter known as WESB) provides features for handling both declared and undeclared faults.
If the interface of the target service (to which you are mediating) has a fault declared, a Callout Fault node will be automatically added to the response flow of your mediation. To define how your mediation behaves when this fault occurs, simply create a wire from the callout fault to the appropriate node in your flow.
To handle unmodeled or undeclared faults in either a request or response flow, create wires from the fail terminals of the node where the error may occur. Fail terminals are the rectangular terminals found on the right of each node. They are triggered whenever an unexpected error occurs during the execution of the node.
Figure 2. Callout Fault and Fail Terminals in a Response Flow
For an example of using fail terminals, please see the following article: Handling unmodeled faults within WebSphere Process Server V6.1
Whether you are creating Java components or are using Java snippets in BPEL or Interface maps, a standardized technique for handling errors should be defined and documented. Java development efforts should leverage the provided "try/catch" capability of the chosen implementation. Standardizing the error handling requirements for custom development projects will greatly increase the productivity of all subsequent problem determination exercises.
Java components use the same
try catch keywords provided by the Java runtime. WebSphere Process Server's programming model provides an implementation of declared and undeclared exceptions.
- Undeclared and unexpected error types. Typically, these are created only by the runtime.
- Declared and expected error types. These are defined on the service interface and programmatically created.
Examples of exception handling for Java components can be found in the following article: Exception handling in WebSphere Process Server and WebSphere Enterprise Service Bus
Regardless of the "catch" capability or the component type used, fundamentally the recognition phase should contain at least the following basic principles for custom Java projects:
- Using the appropriate capability to capture the state of the system as close to the root cause as possible.
- Organizing the errors into groups such as expected or unexpected.
After the appropriate recognition and classification of a problem or exceptions, it is critical to record data about the state of the system for future recovery. This data is essential to the problem determination effort that will follow in the React phase of the strategy.
Like the recognition phase, the record has an important classification effort. Our error handling strategy should consider the differences between recording business and IT events. Although similar, it would not be accurate to assume that all IT events are undeclared and all business events are declared. The mechanism that we will use to record the incident will likely be different.
IT events and logging
Generally, IT events are recorded to a log and/or emitted for IT monitoring products like Tivoli. Since debugging and other problem determination activities, may required a precise and detailed record of what has happened, the volume of detail we chose to record are better suited for logs and collectors. The project should set up a common logging framework and standard. The logging frameworks can be implemented using configurable tools such as java.util.Logging (JSR47 logging).
The standard defined by your local team should define the frequency and depth of each log entry. There are many things to consider:
- Where are trace entries standard?
- Method entry and exits?
- How do we standardize the usage of message severity?
The strategy should define the standard for logging/trace and enforce the usage of the standard with peer or automated code reviews.
Figure 3. JSR47 Logging using Visual Snippets
Business events and CEI
Business Events can be emitted via in a WebSphere environment using the Common Event Infrastructure. Much like logging IT events, it is important to establish a standard for when/where/why these business events are emitted. Emitting too much or too little will have different but adverse effects. For example, if the entire event digest is included in each event and a given event is logged a number of times, then it is easy to see that the additional load on the system would be significant enough to consider the scenario during load and capacity planning. If events are not emitted enough, then the value of the data will suffer.
Finally, an error handling strategy needs to put the proper previsions in place to allow the IT and business organizations react to the adverse condition. There are three ways a system can react to errors:
- Automated recovery
- Trigger manual corrective procedures (by a person such as an administrator)
- Switch to an alternate processing path
The timely and appropriate reaction to a reported condition will tremendously improve the resiliency of the deployed solution. This will in turn increase satisfaction with the provided business services as well as reduce the total cost of ownership of the solution.
Automated recovery within the "React Phase" of the error handling strategy implies that where possible, the application should automatically ensure that the system stays in a consistent and predictable state.
Transaction scopes can be configured and used to maintain the consistent state of resources during the course of an unexpected condition. XA Transactions may be used to create scopes around capable resources. A practical example of the configuration necessary to create and manage transaction scopes will be covered in subsequent articles.
Compensation of microflows and long-running processes can be used to "undo" the outcome of service invocations that have already completed. It is used when choreographing non-transactional services. (If all the services were transactional, you could have them participate in a single transaction).
In long-running processes, compensation of activities that have successfully executed is initially triggered by a fault raised in the process, or can be explicitly triggered using a compensation activity. This is a useful technique for reversing the effects of already-committed transactions within a long-running process.
Figure 4. Compensation handlers and activities
Compensation in microflows is slightly different. If an error occurs during execution of the microflow, the microflow's transaction is marked for rollback; the entire microflow and all the transactional services it invokes will be rolled back. If the process interacts with non-transactional services, this can lead to an inconsistent state across the system. Instead, the business process developer should define compensation services for the activities that invoke non-transaction services. These will be executed during microflow compensation (when the microflow's transaction is rolled back).
Figure 5. Compensation in microflows
For more information and examples, please follow these links:
- Using compensation in business processes with Business Process Choreographer
- Advanced BPEL features > Compensation, on the BPM samples Web site
Additionally, there may be cases where automatic retries can be used to increase the resilience of the application to intermittent errors. WebSphere Process Server and Enterprise Service Bus provide a retry capability automatically under some circumstances; such as when making asynchronous invocations. Identifying when and where retries will be provided "for free" will help the architecture team determine which implementation patterns and invocation styles should be used. The topic of "retries" will be addressed in more detail with subsequent articles.
There will always be unforeseen errors, or errors that the system cannot handle automatically. In these cases, the application developer may choose to have the application "hold" and wait, to allow an administrator (or other human) to manually diagnose and correct the error.
BPEL "Continue on Error" property
On BPEL processes and activities, a developer may set the "continue on error" property to "false". If the activity fails at runtime, the Business Process engine will transition the activity to the "stopped" state, and the process will not proceed to the next activity. The BPC explorer will show the process with the stopped activity under the Critical Processes view, and in some cases, the reason for failure. Stopped activities may be resubmitted or terminated by the process administrator.
Figure 6. Continue on Error, Process-level and Activity-level Setting
Failed events are created automatically by the WebSphere Process Server product. The architecture team should understand exactly when failed events are created and establish the appropriate development standards to leverage this product capability. These failed events will be stored in the system until they are dispositioned by an administrator or programmatically by the Failed Event Manager APIs. These APIs are documented in the web/ directory of your WPS test server installation:
Additional information regarding the failed event manager's capabilities, please refer to the Failed Event Manager documentation in the appropriate information center for your specific product version.
In the case of manual intervention, administrators or the staff responsible for the manual activity would need to be notified. The error handling strategy needs to document how the notifications would be sent, how frequently, who is responsible (role) for acting upon the event. As described previously, the notification mechanism will likely differ based on whether the error condition is an IT level error or a business related problem.
Switch to an alternate processing path
Sometimes, the business process definition accounts for error conditions. These are truly business-level errors, rather than IT level errors, and should be handled accordingly. Specifically, the logic for handling these error conditions is part of the modeled business process flow, rather than one of the techniques described in earlier sections. For example, in a human-centric workflow, all exception cases may be routed to a senior employee for processing. Or, in a business process defined using BPEL, you can take advantage of the "Case" and "Otherwise" clauses of the "Choice" construct to model If-Then-Else conditions.
Evaluate and establish a system framework
These service level agreements will have a direct impact on the types of error handling frameworks that are required. After we have established the requirements for the framework, we can then begin to evaluate the provided product capabilities that we can leverage for error handling and recovery. You can use and combine each of the following product capabilities to establish a framework within the strategy:
- Business processes
- Human tasks
- Failed Event Manager
- Common Event Infrastructure
- Business Activity Monitoring
The capability of the frameworks will define the operational procedures needed in the event of system loss or instability. It is the developer's job to create applications that leverage the framework and the chosen product capability.
The scope of automated recovery is as broad as the imagination. There are many capabilities and tools that can be created with the product provided. After all it is a process automation, integration, and BPM product. Published APIs and interfaces to the Failed Event Manager and the Business Process Container enable limitless possibilities to automate exception processing. The extent of the automated recovery should be discussed and defined before starting the project. As previously stated, the artifacts that are being developed will need to be created with a common theme.
Evaluate services and service providers
The selected framework could be affected by characteristics of the service such as:
- Batch windows and system availability
- Lack of transaction support
- Message tolerance/Assured Delivery: At least once, once and only once
- Reliability of the protocol
As you can see, these factors are usually directly related to the endpoints that are used in the solution and their capabilities. When developing the error handling strategy, you should evaluate each endpoint by a number of characteristics. Then the architecture team can develop common patterns and templates for dealing with the common sets of issues. Along the way, a standard (that is, always handle faults) is created. Adhering to the standard will increase the productivity of the development team and simplify process definitions as "all endpoints behave the same".
Fundamentally, each service interface can have any number of operations. However, one-way operations have different requirements than request response operations. Develop a standard for how to call and manage exceptions from the different types.
Web service imports provide different qualities of service than JMS imports. Therefore each import binding type used within the solution should have a standard interaction. The standard interaction should be developed to normalize the behaviors across the bindings to promote the "all endpoints behave the same" philosophy.
Invocation patterns and transactions
Messaging endpoints have a requirement for asynchronous invocation. The client has to create and commit a message before the Service Provider can provide an asynchronous response. This pattern of behavior should be standardized and made consistent. Additionally, all invocation pattern decisions should be made with a clear understanding of its benefits and the impacts. A poorly designed application that uses an excessive amount of asynchronous interaction whether intentionally or by mistake can substantially increase the necessary amount of error handling logic and significantly reduce the throughput of the system.
Some endpoints are not always available. Whether the outage is planned or unplanned, the team should create a standard strategy for all endpoints to represent a contract for the conditions where the service is unavailable. The error handling strategy should anticipate this fact and provide a standard that promotes quick and easy problem determination and resolution. The standardization of this information also simplifies the creation of the project's operation manual.
There are a wide number of different security options. This article series can not address the vast security domain. However, it is a solid recommendation to approach security in as a consistent manner as possible. Additionally, security related errors should be anticipated and addressed. Again, this will expedite problem resolution and standardize operational procedures.
Performance and throughput
Non-functional performance requirements can have a profound impact on the planning for error handling. Recovery requirements for batch style processing would be different than service providers that are used for synchronous queries. There are several questions that must be answered with respect to performance and throughput.
Bringing it all together
The total solution will be unique to the circumstances, requirements, and preferences of the local team. However, there are design patterns that you can use with every BPM project. This article was provided to start the conversation about error handling strategies and the concepts that are associated to it. In subsequent articles, we will define specific patterns of error handling that can be established as contributions to your local project's standards.
Developing an error handling strategy is a very important and often overlooked design activity. The strategy should precede the design and implementation of business requirements as the strategy will establish standards that the implementation must follow. Each error handling strategy is unique to the given project circumstances. However, you can borrow and customize common patterns to meet your specific needs and service level agreements.
"Exception handling in WebSphere Process Server and WebSphere Enterprise Service Bus"
(developerWorks, Feb 2008)
This article explains how error conditions are captured and processed in WebSphere Process Server and WebSphere Enterprise Service Bus, and describes problem detection, retry behavior, exception propagation, and reporting.
"Handling unmodeled faults within WebSphere Process Server V6.1"
(developerWorks, Feb 2008)
This article shows you how to facilitate fault handling within a BPEL process with a user-defined fault handler and SCA mediation module capabilities.
compensation in business processes with Business Process Choreographer"
(developerWorks, Apr 2006)
This article explains the concepts and usage of compensation in business processes that are run with the Business Process Choreographer, a component of IBM WebSphere Process Server Version 6.0.
- "Business Process Management Samples & Tutorials"
- "JSR-47 Logging Specification"
application connectivity zone
Provides technical resources for your application integration and process integration needs.
- Browse the
Process Server and WebSphere Integration Developer resource page
Provides tutorials, articles and other resources to get you started with WebSphere Process Server and WebSphere Integration Developer.
- Browse the technology bookstore for books on these and other technical topics.
Get products and technologies
- Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- Participate in the discussion forum.
- Check out developerWorks blogs and get involved in the developerWorks community.