Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Quality busters: Losing messages

Addressing error scenarios in message-oriented systems

Michael Russell (MikeRussell@VickiFox.com), Application Architect, Vicki Fox Productions, Inc.
Photo of Michael Russell
Michael Russell has a bachelor's degree in physics and a master's degree in computer science. He was a logistics engineer, a technical services manager, and a certified IT architect at IBM for nearly 14 years. He is currently a Web application architect for a resort company in Orlando. He has experience in Windows, UNIX, and OS/400 environments. He uses Web technology for entertainment through his own company, Vicki Fox Productions (http://www.VickiFox.com).

Summary:  The success of a message-oriented system, regardless of the technology used to implement it, depends upon the consistent and reliable delivery of messages between processes. In this installment of Quality busters, Michael Russell identifies some of the places or failure points along the path between processes where a message can be lost or rejected. If you don't properly address these failure points, the results might include data corruption, out-of-sync conditions, timeouts, and perceived unreliability.

View more content in this series

Date:  28 Dec 2004
Level:  Introductory

Activity:  4616 views
Comments:  

This article is part of the Quality busters series, where I look at common influences on application quality from the enterprise view of the operational environment and non-functional requirements. Addressing these influences is a matter of making tradeoffs, with no single solution solving all problems.

The case of the missing messages

The SHEEP application, an optimized sales and marketing Web application, exchanges information with a legacy payroll application. One such information exchange involves the identity and commission rate for sales representatives. A very important exchange occurs when a new sales representative is hired and entered into the payroll system. This information exchange was implemented using XML-formatted messages and IBM® WebSphere® MQ.

The payroll application team recently installed a new version of the payroll application. For various reasons, the team members did not test the exchange of messages between the new version and the SHEEP application. The team thought that, because they made no change in the message format, such testing wasn't necessary.

A couple of weeks later, two new sales representatives were hired. But every time these individuals tried to access the SHEEP application, the attempt failed, and the application reported that they were "not a registered sales representative." Simply put, these new employees were not found in the SHEEP application's database. The sales manager once again called the SHEEP Web team to find out what happened and to get these new employees registered immediately.

After manually creating registration entries for the employees, the SHEEP team researched the problem. They found that the legacy team changed how commission rates were stored but did not update the message adapter correctly. As a result, the message arrived at the SHEEP application with a commission rate of 0.00. This was an invalid rate, and the SHEEP application rejected the message. Unfortunately, it simply wrote an entry in a log file about the message rejection -- and nobody reviewed that log file. As a result, the message rejection was ignored until the users complained.


So many failure points

What the SHEEP team members discovered was just one of the many failure points associated with message-oriented systems. In their case, they discovered the problem with message rejections -- how are they reported, where that report is sent, who acts on it, and how to recover from the problem.

In this article, I'm going to briefly discuss some of the failure points in message-oriented systems. These failure points are independent of technology (for example, WebSphere MQ, SOAP over HTTP, or UNIX message queues), message format (such as XML, tagged, or fixed length), or architecture. These failure points are common to all message-oriented interprocess communications mechanisms.


Message types

For the remainder of this article, I will be using generic terminology that is relatively independent of a specific technology, format, or architecture.

Failure points have different considerations depending upon the message type used. The most commonly used message types are synchronous and asynchronous.

Synchronous messages, sometimes called request/reply messages, have a number of distinguishing characteristics. They are:

  • Time dependent. The message must be processed and a response received within a specific time interval, called the timeout period. If the message is not processed in this time, then the message can be ignored or expired.

  • Conversational. The messaging consists of a request message and a matching reply message. The process that sent the request message consumes the reply message. Usually, the requesting process waits until the reply is received or a timeout occurs.

  • Small. The messages are often small in size.

  • Transactional. The messages may be part of a larger transaction orchestrated by the requesting process.

  • Not persistent. Because these messages expire, it is not necessary to guarantee their delivery or to persist them in the message middleware.

Asynchronous messages, sometimes called datagram messages, have different characteristics. They are:

  • Time independent. The message must be processed eventually, but no particular time interval is specified.

  • Directional. The message consists of a single datagram message going from one process to another.

  • Large. The messages are often large in size.

  • Non-transactional. The message may initiate a transaction in the receiving process, but it is not part of an extended transaction in the sending process.

  • Persistent. Because these messages do not expire, it is necessary to guarantee delivery and persist the message in the middleware.

Path failures

The first common type of message failure is a path failure. These are failures that occur along a message's journey from the sending process to the receiving process. A generalized message path is illustrated in Figure 1.


Figure 1. Message path failure points
Message path failure points

I'll describe path failures in more detail in the following subsections.

Send failures

A send failure occurs when the sending process is unable to deliver a message to the message delivery mechanism. A send failure might occur if the message delivery mechanism is not available or running, the sending process does not have adequate authority or access rights, or the sending process is incorrectly configured (among other reasons).

If the message is asynchronous, then the sending process must somehow retain the message and keep trying the connection to the message delivery mechanism until it is successfully delivered. This may require adding a journal or other persistent store for unsent messages. The sending process also needs to ensure that the message is persisted from the time of message generation until it is successfully handed off to the delivery mechanism. The pseudo-code in Listing 1 illustrates an approach to handling asynchronous messages in the sending process.


Listing 1. Asynchronous message send processing
 

    void sendMsg( eventData msgEvent ) throws RetryError {
        eventStore.add(msgEvent)
        while(true) {
            msgData = formatIntoMsgData(msgEvent)
            Queue.putMsg(msgData)
            if (Queue.msgSentWithoutError()) break
            Queue.logError()
            if (retryCountExceeded()) throw RetryError
        }
        eventStore.remove(msgEvent)
    }

For synchronous messages, you have more flexibility. Synchronous messages have an expiry time; thus, if the message cannot be sent, then the sending process will timeout and invoke the specified timeout processing. The pseudo-code in Listing 2 illustrates an approach to handling synchronous messages in the sending process.


Listing 2. Synchronous message send processing
 

    ReplyData sendMsgReceiveReply( eventData msgEvent )
    throws TimeoutError, DeliveryError {
        sendMsgData = formatIntoMsgData(msgEvent)
        OutboundQueue.putMsg(msgData)
        if (OutboundQueue.error()) throw DeliveryError
        replyMsg = ReplyQueue.getMsgTimeout()
        if (ReplyQueue.timeout()) throw TimeoutError
        if (ReplyQueue.error()) throw DeliveryError
        replyData = parseMsgIntoData(replyMsg)
        return replyData
    }

Transmit failures

A transmit failure occurs when the delivery mechanism is unable to transmit a message from the sending process to the receiving process. Some technologies, such as WebSphere MQ, have built-in support for reliable message delivery and persistence. Other technologies, such as Web services, do not have native support for reliable delivery and persistence, requiring developers to build such capabilities.

If your system is based on the store-and-forward routing of messages through multiple components (if there is a message broker or integration hub, for example) then the number of times the message is handled during transit increases, with a corresponding increase in likely failure points.

When an architect is building a message-oriented solution, he or she must understand how the selected delivery mechanism can fail. Can it guarantee message delivery -- for example, by using two-phase commit? Does it support persistence? Can it be used to implement distributed transactions? And, when it does fail, how can the sending or receiving process detect that failure?

In some cases, the delivery mechanism might not find the destination and will forward the message to a dead letter queue. If this happens, it is important to have a process that monitors that queue.

Receive failures

A receive failure occurs when the receiving process is unable to retrieve and handle a message available in the delivery mechanism. A receive failure can occur if, for instance, the message delivery mechanism is not available or running, the receiving process does not have adequate authority or access rights, or the receiving process is incorrectly configured.

If the message is asynchronous, then the receiving process must retain the message until the message has been successfully processed. This means that the receiving process must keep the message in some persistence store (perhaps keeping it on the inbound queue) until the processing is complete. The goal is to eliminate the window of vulnerability between reading the message from the delivery mechanism and the completion of processing. If a failure occurs during this window of vulnerability, then the message is lost. The ideal solution is to use a delivery mechanism that has persistent queues so that the message stays on the queue until explicitly deleted. The pseudo-code in Listing 3 illustrates this approach in the receiving process.


Listing 3. Asynchronous message receive processing
 

    void receiveMsg( MsgProcess theProcess )
    throws DeliveryError, ProcessingError {
        receivedMsg = InboundQueue.browseAndLockMsg()
        if (InboundQueue.error()) throw DeliveryError
        receivedData = parseMsgIntoData(receivedMsg)
        theProcess.process(receivedData)
        InboundQueue.unlockAndDeleteMsg(receivedMsg)
    }

For synchronous messages, you have more flexibility. Since the sending process will timeout and go into error handling if a reply does not arrive, the receiving process does not have to ensure that the message is maintained until successfully processed.

Format failures

A format failure occurs when a message generation event is unable to build a message in the agreed upon message format. Formatting failures can be a result of missing information or corrupted data.

In many mission-critical applications, it is vital that all message generation events result in a message properly delivered to the destination process. As a result, it is important to handle format errors. One solution might be to report any formatting failure that occurs and then substitute default values for unreadable values from the original message.

Parse failures

A parse failure occurs when a receiving process is unable to parse and translate a received message. Parsing failures can be caused by incorrectly formatted messages, missing required data, translation or transformation errors, and other related problems.

An architect must consider how to handle parse errors. Can improper values be replaced with default values, thus allowing the message to be processed? Or must the message be rejected? If it is rejected, is an error message sent to the sending process? Or is an entry written in a message error log to be handled later by an operator?


Contract failures

The second category of message failure is a contract failure. These are failures that occur as a result of change in the message contract -- that is, in the format and meaning of the fields in the message. These conflicts result from the failure to coordinate message contract changes with both the sending and receiving processes.

The common approach to handling contract failures is to coordinate the simultaneous update of both the sending and receiving processes whenever the message contract changes. However, this defeats the goal of process independence, because the processes are tightly coupled by the shared messages.

One way to address this problem is to add versioning information to messages. Now, each time the message contract changes -- if a field is added or removed or redefined, for instance -- the version identification of the message changes.

The architect must decide what form this versioning information will take. It could be a version number in the header tag. It could be a standalone tag. It could be a namespace. It could be a fixed field at the beginning of fixed-length messages. It could be something else; but there needs to be something.

For asynchronous messages, the message would contain the message's format version. Listing 4 shows a sample XML message with a version number in the header tag.


Listing 4. Asynchronous message versioning example
 

<RegisterSalesRep version="1"> 
    <!-- . . . rest of message . . . --> 
</RegisterSalesRep>

For synchronous messages, the request message would contain both the message's contract version and the desired contract version for the reply message. This is critical, because it is very likely that the requesting process will not support the most recent reply message contract, but rather only a previous version. Listing 5 shows a sample XML message with current and reply version information.


Listing 5. Synchronous request message versioning example
 

<GetSalesRepDetails requestVersion="3" replyVersion="2"> 
    <!-- . . . rest of message . . . --> 
</GetSalesRepDetails>

Handling message versions, however, adds complexity to the application. The server application will have to maintain several back-level version parsing and formatting components. Also, whenever the interface to the back-end component changes -- if new data elements are added, for instance -- then every back-level version parsing and formatting component requires an update as well. Figure 2 shows the increase in application components necessary to support message versioning.


Figure 2. Versioning complexity example
Versioning complexity example

Failure handling

While it's easy to understand the potential failure points in a messaging-based application, dealing with those failure points is not so simple. For example:

  • It may be acceptable to lose some messages; for other, more critical messages, this may not be an option.
  • Some messages may be time-independent; others may be time-dependent.
  • Some messages may allow for instances to arrive in any order; others may depend upon proper sequencing.
  • Some messages may be generated by a background or batch process; others may be generated by an interactive, user-initiated process.
  • Some messages may handle duplicates; others may not permit duplicates.

It is important for an architect to have a toolkit of failure-handling techniques for message-oriented systems. In the following subsections, I offer a survey of some approaches.

Message identification

To avoid processing a message twice, a form of sequence numbering or message identifying is needed. The sending process assigns each message a unique identifier at message generation time. If the message has to be retransmitted as a result of connection error recovery, then the message must retain its original message identifier. The receiving process must keep track of the message identifier of messages processed. In this way, if the receiving process receives a message with a message identifier matching that of a message it has already processed, then the receiving process can reject the message as being a duplicate.

Sequence identification

To ensure that messages are processed in a particular sequence, a form of message sequence identification must be added to the message. The receiving process can then hold messages until the desired sequence of messages has been completely received. At that point, it will process these held messages in the proper sequential order.

Asynchronous retries

For asynchronous messages, the sending process should implement a journal or other persistence mechanism to save messages, and then keep trying to send them until it receives an acknowledgement. If, after some number of attempts, the delivery mechanism still does not respond, then the process should notify the operator for manual intervention.

Timeout

For synchronous messages, the sending process must implement a timeout mechanism. If a reply is not received within this timeout period, then the sending process reports an error. To assist, if the delivery mechanism supports message expiry, then the message expiry should be set to this timeout period.

Persistence

Where possible, save messages in a persistence store until the message is successfully processed. Some delivery mechanism products, such as WebSphere MQ, implement a persistence store, allowing messages to remain on a queue until expressly deleted.

Logging

The receiving process might log each message received. A database table or other persistence mechanism can store the messages. This logging serves several purposes. As an audit trail, it verifies that the receiving process received the message. If you build an operator console, an operator can use this logging file to view, repair, and retransmit messages rejected by the receiving process.

Dead letter queue monitoring

If the delivery mechanism provides a dead letter queue, then it is vital that a monitoring process exist. But what should the process do with any messages that are routed to the dead letter queue? The process might forward the message to another queue for manual intervention, forward the message to the proper destination queue, or just ignore the messages.

Error reporting

In the case of synchronous messages, errors are sent as a response message to the requesting process. This allows the requestor to immediately deal with the error situation. But in the case of asynchronous messages, which are more batch-like in behavior, there is no process to receive error reports. The architect must decide how to report errors and message rejections for asynchronous messages. Here are some suggestions:

  • Put the information into the error log. Of course, a log can generally only be accessed by a programmer, and the programmer may not understand the business importance of the message.

  • Put the information into a special message journal database table. This allows for the creation of an operator console where rejected messages can be viewed, edited, and retransmitted. But this does add the overhead of a quickly growing journal table and requires the investment of time for building a special operator console.

  • Send an alert message to an operator monitoring utility. While this allows the operations team to learn about messages being rejected, it does not assist with the recovery of those messages.

  • Send an error report back to the sending process, thus deferring error handling to the sending process. This requires that the original message contain information about the sending process and the sending process's error handling queue so the reporting mechanism knows where to send the error report.

Error responsibility

Just as difficult as determining how and where to send error reports is determining who is responsible for monitoring and repairing message errors. Ideally, the end-user or the business owner community are the ones who correct the message errors. To enable this requires the development of additional programs that provide user-friendly assistance, maintenance, and monitoring. Unfortunately, the application project plan rarely considers or funds these additional programs.

In reality, it seems application developers are the ones dealing with the message errors. Since the developers have the tools and knowledge of the message environment, they are able to diagnose the situation. However, the developers need the context information from the end-user or business owner to repair the situation. As a result, placing the responsibility for message monitoring and repair with developers or system operations is inefficient because they have to spend time diagnosing the problem, tracking down the end-user or business owner, and making the repair. This repair time is a distraction from the developer's regular job responsibilities.


Considerations

When creating a distributed application, it is important to keep the failure points in mind. The following are some questions the architect should ask:

  • What are the failure points along the path of message delivery from process A to process B?
  • What facilities are available in the selected delivery mechanism for handling failures or assisting with failure recovery?
  • What is the type of each message? Is it synchronous or asynchronous?
  • If a send failure occurs, how is that reported within the sending process?
  • If a transmit failure occurs, how is that reported, and to what agent?
  • If a receive failure occurs, how is that reported and to what agent?
  • Is there a way to edit and resend rejected messages?
  • Who is responsible for monitoring and handling a rejected or erroneous message?
  • Should messages be recorded in a persistence store for auditing or recovery handling?
  • If the applications includes a dead letter queue, who monitors it ? How is the queue monitored?
  • Are messages versioned? Are message adapters available for each supported version?

Summary

This article only touches the surface of issues related to message-oriented application failure points. The application architect must consider all the failure points that might interfere with data moving from one process to another. A single solution does not exist for all messages. Asynchronous messages are handled differently than synchronous messages. The chosen technology for message delivery also influences the approaches available and used. By having an adequate toolkit of approaches, an architect can pick the best approach for each situation and message.


Resources

About the author

Photo of Michael Russell

Michael Russell has a bachelor's degree in physics and a master's degree in computer science. He was a logistics engineer, a technical services manager, and a certified IT architect at IBM for nearly 14 years. He is currently a Web application architect for a resort company in Orlando. He has experience in Windows, UNIX, and OS/400 environments. He uses Web technology for entertainment through his own company, Vicki Fox Productions (http://www.VickiFox.com).

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development
ArticleID=32393
ArticleTitle=Quality busters: Losing messages
publish-date=12282004
author1-email=MikeRussell@VickiFox.com
author1-email-cc=htc@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers