This article is part of the Quality busters series, where I look at common influences on application quality from the enterprise view of the operational environment and non-functional requirements. Addressing these influences is a matter of making tradeoffs, with no single solution solving all problems.
The case of the missing messages
The SHEEP application, an optimized sales and marketing Web application, exchanges information with a legacy payroll application. One such information exchange involves the identity and commission rate for sales representatives. A very important exchange occurs when a new sales representative is hired and entered into the payroll system. This information exchange was implemented using XML-formatted messages and IBM® WebSphere® MQ.
The payroll application team recently installed a new version of the payroll application. For various reasons, the team members did not test the exchange of messages between the new version and the SHEEP application. The team thought that, because they made no change in the message format, such testing wasn't necessary.
A couple of weeks later, two new sales representatives were hired. But every time these individuals tried to access the SHEEP application, the attempt failed, and the application reported that they were "not a registered sales representative." Simply put, these new employees were not found in the SHEEP application's database. The sales manager once again called the SHEEP Web team to find out what happened and to get these new employees registered immediately.
After manually creating registration entries for the employees, the SHEEP team researched the problem. They found that the legacy team changed how commission rates were stored but did not update the message adapter correctly. As a result, the message arrived at the SHEEP application with a commission rate of 0.00. This was an invalid rate, and the SHEEP application rejected the message. Unfortunately, it simply wrote an entry in a log file about the message rejection -- and nobody reviewed that log file. As a result, the message rejection was ignored until the users complained.
What the SHEEP team members discovered was just one of the many failure points associated with message-oriented systems. In their case, they discovered the problem with message rejections -- how are they reported, where that report is sent, who acts on it, and how to recover from the problem.
In this article, I'm going to briefly discuss some of the failure points in message-oriented systems. These failure points are independent of technology (for example, WebSphere MQ, SOAP over HTTP, or UNIX message queues), message format (such as XML, tagged, or fixed length), or architecture. These failure points are common to all message-oriented interprocess communications mechanisms.
For the remainder of this article, I will be using generic terminology that is relatively independent of a specific technology, format, or architecture.
Failure points have different considerations depending upon the message type used. The most commonly used message types are synchronous and asynchronous.
Synchronous messages, sometimes called request/reply messages, have a number of distinguishing characteristics. They are:
- Time dependent. The message must be processed and a response received within a specific time interval, called the timeout period. If the message is not processed in this time, then the message can be ignored or expired.
- Conversational. The messaging consists of a request message and a matching reply message. The process that sent the request message consumes the reply message. Usually, the requesting process waits until the reply is received or a timeout occurs.
- Small. The messages are often small in size.
- Transactional. The messages may be part of a larger transaction orchestrated by the requesting process.
- Not persistent. Because these messages expire, it is not necessary to guarantee their delivery or to persist them in the message middleware.
Asynchronous messages, sometimes called datagram messages, have different characteristics. They are:
- Time independent. The message must be processed eventually, but no particular time interval is specified.
- Directional. The message consists of a single datagram message going from one process to another.
- Large. The messages are often large in size.
- Non-transactional. The message may initiate a transaction in the receiving process, but it is not part of an extended transaction in the sending process.
- Persistent. Because these messages do not expire, it is necessary to guarantee delivery and persist the message in the middleware.
The first common type of message failure is a path failure. These are failures that occur along a message's journey from the sending process to the receiving process. A generalized message path is illustrated in Figure 1.
Figure 1. Message path failure points
I'll describe path failures in more detail in the following subsections.
A send failure occurs when the sending process is unable to deliver a message to the message delivery mechanism. A send failure might occur if the message delivery mechanism is not available or running, the sending process does not have adequate authority or access rights, or the sending process is incorrectly configured (among other reasons).
If the message is asynchronous, then the sending process must somehow retain the message and keep trying the connection to the message delivery mechanism until it is successfully delivered. This may require adding a journal or other persistent store for unsent messages. The sending process also needs to ensure that the message is persisted from the time of message generation until it is successfully handed off to the delivery mechanism. The pseudo-code in Listing 1 illustrates an approach to handling asynchronous messages in the sending process.
Listing 1. Asynchronous message send processing
void sendMsg( eventData msgEvent ) throws RetryError {
eventStore.add(msgEvent)
while(true) {
msgData = formatIntoMsgData(msgEvent)
Queue.putMsg(msgData)
if (Queue.msgSentWithoutError()) break
Queue.logError()
if (retryCountExceeded()) throw RetryError
}
eventStore.remove(msgEvent)
}
|
For synchronous messages, you have more flexibility. Synchronous messages have an expiry time; thus, if the message cannot be sent, then the sending process will timeout and invoke the specified timeout processing. The pseudo-code in Listing 2 illustrates an approach to handling synchronous messages in the sending process.
Listing 2. Synchronous message send processing
ReplyData sendMsgReceiveReply( eventData msgEvent )
throws TimeoutError, DeliveryError {
sendMsgData = formatIntoMsgData(msgEvent)
OutboundQueue.putMsg(msgData)
if (OutboundQueue.error()) throw DeliveryError
replyMsg = ReplyQueue.getMsgTimeout()
if (ReplyQueue.timeout()) throw TimeoutError
if (ReplyQueue.error()) throw DeliveryError
replyData = parseMsgIntoData(replyMsg)
return replyData
}
|
A transmit failure occurs when the delivery mechanism is unable to transmit a message from the sending process to the receiving process. Some technologies, such as WebSphere MQ, have built-in support for reliable message delivery and persistence. Other technologies, such as Web services, do not have native support for reliable delivery and persistence, requiring developers to build such capabilities.
If your system is based on the store-and-forward routing of messages through multiple components (if there is a message broker or integration hub, for example) then the number of times the message is handled during transit increases, with a corresponding increase in likely failure points.
When an architect is building a message-oriented solution, he or she must understand how the selected delivery mechanism can fail. Can it guarantee message delivery -- for example, by using two-phase commit? Does it support persistence? Can it be used to implement distributed transactions? And, when it does fail, how can the sending or receiving process detect that failure?
In some cases, the delivery mechanism might not find the destination and will forward the message to a dead letter queue. If this happens, it is important to have a process that monitors that queue.
A receive failure occurs when the receiving process is unable to retrieve and handle a message available in the delivery mechanism. A receive failure can occur if, for instance, the message delivery mechanism is not available or running, the receiving process does not have adequate authority or access rights, or the receiving process is incorrectly configured.
If the message is asynchronous, then the receiving process must retain the message until the message has been successfully processed. This means that the receiving process must keep the message in some persistence store (perhaps keeping it on the inbound queue) until the processing is complete. The goal is to eliminate the window of vulnerability between reading the message from the delivery mechanism and the completion of processing. If a failure occurs during this window of vulnerability, then the message is lost. The ideal solution is to use a delivery mechanism that has persistent queues so that the message stays on the queue until explicitly deleted. The pseudo-code in Listing 3 illustrates this approach in the receiving process.
Listing 3. Asynchronous message receive processing
void receiveMsg( MsgProcess theProcess )
throws DeliveryError, ProcessingError {
receivedMsg = InboundQueue.browseAndLockMsg()
if (InboundQueue.error()) throw DeliveryError
receivedData = parseMsgIntoData(receivedMsg)
theProcess.process(receivedData)
InboundQueue.unlockAndDeleteMsg(receivedMsg)
}
|
For synchronous messages, you have more flexibility. Since the sending process will timeout and go into error handling if a reply does not arrive, the receiving process does not have to ensure that the message is maintained until successfully processed.
A format failure occurs when a message generation event is unable to build a message in the agreed upon message format. Formatting failures can be a result of missing information or corrupted data.
In many mission-critical applications, it is vital that all message generation events result in a message properly delivered to the destination process. As a result, it is important to handle format errors. One solution might be to report any formatting failure that occurs and then substitute default values for unreadable values from the original message.
A parse failure occurs when a receiving process is unable to parse and translate a received message. Parsing failures can be caused by incorrectly formatted messages, missing required data, translation or transformation errors, and other related problems.
An architect must consider how to handle parse errors. Can improper values be replaced with default values, thus allowing the message to be processed? Or must the message be rejected? If it is rejected, is an error message sent to the sending process? Or is an entry written in a message error log to be handled later by an operator?
The second category of message failure is a contract failure. These are failures that occur as a result of change in the message contract -- that is, in the format and meaning of the fields in the message. These conflicts result from the failure to coordinate message contract changes with both the sending and receiving processes.
The common approach to handling contract failures is to coordinate the simultaneous update of both the sending and receiving processes whenever the message contract changes. However, this defeats the goal of process independence, because the processes are tightly coupled by the shared messages.
One way to address this problem is to add versioning information to messages. Now, each time the message contract changes -- if a field is added or removed or redefined, for instance -- the version identification of the message changes.
The architect must decide what form this versioning information will take. It could be a version number in the header tag. It could be a standalone tag. It could be a namespace. It could be a fixed field at the beginning of fixed-length messages. It could be something else; but there needs to be something.
For asynchronous messages, the message would contain the message's format version. Listing 4 shows a sample XML message with a version number in the header tag.
Listing 4. Asynchronous message versioning example
<RegisterSalesRep version="1">
<!-- . . . rest of message . . . -->
</RegisterSalesRep>
|
For synchronous messages, the request message would contain both the message's contract version and the desired contract version for the reply message. This is critical, because it is very likely that the requesting process will not support the most recent reply message contract, but rather only a previous version. Listing 5 shows a sample XML message with current and reply version information.
Listing 5. Synchronous request message versioning example
<GetSalesRepDetails requestVersion="3" replyVersion="2">
<!-- . . . rest of message . . . -->
</GetSalesRepDetails>
|
Handling message versions, however, adds complexity to the application. The server application will have to maintain several back-level version parsing and formatting components. Also, whenever the interface to the back-end component changes -- if new data elements are added, for instance -- then every back-level version parsing and formatting component requires an update as well. Figure 2 shows the increase in application components necessary to support message versioning.
Figure 2. Versioning complexity example
While it's easy to understand the potential failure points in a messaging-based application, dealing with those failure points is not so simple. For example:
- It may be acceptable to lose some messages; for other, more critical messages, this may not be an option.
- Some messages may be time-independent; others may be time-dependent.
- Some messages may allow for instances to arrive in any order; others may depend upon proper sequencing.
- Some messages may be generated by a background or batch process; others may be generated by an interactive, user-initiated process.
- Some messages may handle duplicates; others may not permit duplicates.
It is important for an architect to have a toolkit of failure-handling techniques for message-oriented systems. In the following subsections, I offer a survey of some approaches.
To avoid processing a message twice, a form of sequence numbering or message identifying is needed. The sending process assigns each message a unique identifier at message generation time. If the message has to be retransmitted as a result of connection error recovery, then the message must retain its original message identifier. The receiving process must keep track of the message identifier of messages processed. In this way, if the receiving process receives a message with a message identifier matching that of a message it has already processed, then the receiving process can reject the message as being a duplicate.
To ensure that messages are processed in a particular sequence, a form of message sequence identification must be added to the message. The receiving process can then hold messages until the desired sequence of messages has been completely received. At that point, it will process these held messages in the proper sequential order.
For asynchronous messages, the sending process should implement a journal or other persistence mechanism to save messages, and then keep trying to send them until it receives an acknowledgement. If, after some number of attempts, the delivery mechanism still does not respond, then the process should notify the operator for manual intervention.
For synchronous messages, the sending process must implement a timeout mechanism. If a reply is not received within this timeout period, then the sending process reports an error. To assist, if the delivery mechanism supports message expiry, then the message expiry should be set to this timeout period.
Where possible, save messages in a persistence store until the message is successfully processed. Some delivery mechanism products, such as WebSphere MQ, implement a persistence store, allowing messages to remain on a queue until expressly deleted.
The receiving process might log each message received. A database table or other persistence mechanism can store the messages. This logging serves several purposes. As an audit trail, it verifies that the receiving process received the message. If you build an operator console, an operator can use this logging file to view, repair, and retransmit messages rejected by the receiving process.
If the delivery mechanism provides a dead letter queue, then it is vital that a monitoring process exist. But what should the process do with any messages that are routed to the dead letter queue? The process might forward the message to another queue for manual intervention, forward the message to the proper destination queue, or just ignore the messages.
In the case of synchronous messages, errors are sent as a response message to the requesting process. This allows the requestor to immediately deal with the error situation. But in the case of asynchronous messages, which are more batch-like in behavior, there is no process to receive error reports. The architect must decide how to report errors and message rejections for asynchronous messages. Here are some suggestions:
- Put the information into the error log. Of course, a log can generally only be accessed by a programmer, and the programmer may not understand the business importance of the message.
- Put the information into a special message journal database table. This allows for the creation of an operator console where rejected messages can be viewed, edited, and retransmitted. But this does add the overhead of a quickly growing journal table and requires the investment of time for building a special operator console.
- Send an alert message to an operator monitoring utility. While this allows the operations team to learn about messages being rejected, it does not assist with the recovery of those messages.
- Send an error report back to the sending process, thus deferring error handling to the sending process. This requires that the original message contain information about the sending process and the sending process's error handling queue so the reporting mechanism knows where to send the error report.
Just as difficult as determining how and where to send error reports is determining who is responsible for monitoring and repairing message errors. Ideally, the end-user or the business owner community are the ones who correct the message errors. To enable this requires the development of additional programs that provide user-friendly assistance, maintenance, and monitoring. Unfortunately, the application project plan rarely considers or funds these additional programs.
In reality, it seems application developers are the ones dealing with the message errors. Since the developers have the tools and knowledge of the message environment, they are able to diagnose the situation. However, the developers need the context information from the end-user or business owner to repair the situation. As a result, placing the responsibility for message monitoring and repair with developers or system operations is inefficient because they have to spend time diagnosing the problem, tracking down the end-user or business owner, and making the repair. This repair time is a distraction from the developer's regular job responsibilities.
When creating a distributed application, it is important to keep the failure points in mind. The following are some questions the architect should ask:
- What are the failure points along the path of message delivery from process A to process B?
- What facilities are available in the selected delivery mechanism for handling failures or assisting with failure recovery?
- What is the type of each message? Is it synchronous or asynchronous?
- If a send failure occurs, how is that reported within the sending process?
- If a transmit failure occurs, how is that reported, and to what agent?
- If a receive failure occurs, how is that reported and to what agent?
- Is there a way to edit and resend rejected messages?
- Who is responsible for monitoring and handling a rejected or erroneous message?
- Should messages be recorded in a persistence store for auditing or recovery handling?
- If the applications includes a dead letter queue, who monitors it ? How is the queue monitored?
- Are messages versioned? Are message adapters available for each supported version?
This article only touches the surface of issues related to message-oriented application failure points. The application architect must consider all the failure points that might interfere with data moving from one process to another. A single solution does not exist for all messages. Asynchronous messages are handled differently than synchronous messages. The chosen technology for message delivery also influences the approaches available and used. By having an adequate toolkit of approaches, an architect can pick the best approach for each situation and message.
- Read the author's other articles in the Quality busters series on developerWorks:
- Find a detailed exploration of building SOA-based Web services solutions in Service-Oriented Architecture: A Field Guide to Integrating XML and Web Services, Thomas Erl (Prentice Hall, 2004).
- Peruse this patterns bible for message-oriented systems: Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions, Gregor Hohpe and Bobby Woolf (Addison Wesley Professional, 2003).
- Check out the architectural pattern of message brokers in A System of Patterns, Frank Buschmann et al (John Wiley, 1996).
- Take a detailed look at messaging applications based upon the early message queue interface (MQI) specification underlying IBM® WebSphere® MQ in Messaging and Queuing Using the MQI, Burnie Blakeley et al (Mcgraw-Hill). Note that this book was published in 1995 and may be difficult to find.
- Read detailed suggestions on adding version information to XML-formatted messages in "Designing extensible, versionable XML formats," Dare Obasanjo (MSDN, July 2004).
- Get techniques for adding version information to Web services messages "Versioning options," Scott Seely (MSDN, October 2002).
- Visit the Web Architecture zone on the IBM developerWorks site for articles and tutorials covering various Web-based solutions.

Michael Russell has a bachelor's degree in physics and a master's degree in computer science. He was a logistics engineer, a technical services manager, and a certified IT architect at IBM for nearly 14 years. He is currently a Web application architect for a resort company in Orlando. He has experience in Windows, UNIX, and OS/400 environments. He uses Web technology for entertainment through his own company, Vicki Fox Productions (http://www.VickiFox.com).




