You've just launched a major software initiative that integrates fifteen distinct systems among five different organizations into a real-time, first-of-its-kind business system. Customers and partners love it. Thousands of transactions are rolling in! But you've discovered that things don't always work as planned. Ten percent of the transactions fail for seemingly random reasons. External systems that are out of your control have unpredictable bugs that cause problems in your system. A nasty intermittent problem causes expensive downtime. You've come to dread any call from your operations team for fear of hearing about yet another elusive software problem. When debugging these problems, you feel like a sonar operator searching for an impossibly silent submarine among all the ocean noise.
This story is familiar to me because I've often neglected to adequately take into account run-time failure modes when designing my systems. And because I've had the opportunity to live with my mistakes and to learn from talented and creative professionals, I now subscribe to a design and operations philosophy for monitoring and responding which helps me to avoid some of my earlier mistakes. This paper describes these practices and provides some real-world examples to illustrate their usefulness.
If designed correctly, a good monitor and response system is both inexpensive and quick to implement and will allow you to:
- Spot problems before they become crises
- Appropriately respond to operational problems
- Accurately prioritize software defects, within your system or in external systems
- Measure system health and track improvements over time
- Record key metrics and report them to management, clients, partners and suppliers
- Efficiently identify enhancements and improvements, and accurately estimate the scope and impact of these changes
This article is primarily for software architects and developers who participate in the creation and operation of complex distributed systems. The solutions presented here apply most readily to systems with the following characteristics:
- Responsibility is distributed across disparate functional organizations
- Services are distributed across multiple geographic locations
- Software and hardware components are heterogeneous
- Inter-process communications are both synchronous and asynchronous
- Requirements for availability and reliability are demanding
- Consequences of system failure are significant
The body of this paper addresses three main topics:
- Exceptions -- recording detailed information about problems
- Monitoring -- using a separate system to keep an eye on things
- Responding -- implementing automated processes for handling problems once they arise
Developers writing software in object-oriented languages like Java, C#, and C++ are familiar with exception handling techniques; but even if you are writing in a language without built-in compiler support for an Exception class, you've got to provide support for run-time errors. Run-time errors are unavoidable and are frequently caused by problems that are out of your control. For the purposes of this discussion, I'll define an exception as any problem that is automatically reported by your code. Some exceptions can be handled while others cannot. In distributed systems, often you can do very little with an exception other than to report that it happened. Exceptions are not bugs, though some exceptions are caused by bugs.
Common types of exceptions in a distributed system include:
- Network error
- Invalid message (missing or incorrect data, bad format, and others)
- Timeout
- Software exception (such as null pointer or div/0)
Common causes for exceptions are:
- Incorrect software (bug)
- Hardware / network failure
- Improper configuration
- Version mismatch
- Data corruption (due to software or hardware failure see above)
- Overloaded system
The following examples illustrate how exceptions vary in their severity, and the ways in which to respond. Some very severe exceptions, like the transport exception in Example 1, are easily diagnosed, while others (Example 2) may take days to debug. Less significant, but more irritating, intermittent problems like Example 3 may be easily diagnosed, but only if you have the proper monitoring and response infrastructure already in place.
A router failure causes two of your four partners to be unavailable.
![]() Symptom: Transport Exception for some of your messages. |
Your trading partner experienced some unplanned maintenance in their mainframe rating system. Their quoting service still returns a message, but the XML is garbled with invalid formatting, since they are passing the entire mainframe error message back in the XML.
![]() Symptom: Invalid XML exception |
You exchange messages with a trading partner. An error in mapping causes your transformation process to generate a null pointer error, but only with certain obscure combinations of data.
![]() Symptom: Null pointer exception. |
Logging exceptions have well established design patterns (see Resources), but the following list provides a quick overview of some good practices for distributed systems:
- Know your audience. Provide enough information to help them discover the root cause of the problem. Be careful not to provide so much information that the responder is overwhelmed with trivial data.
- Support multiple levels of logging -- at a minimum, provide Error, Warn, Info, Debug.
- Provide a feature for changing logging settings at run-time. You shouldn't have to restart the system to change logging settings. This is especially important when debugging intermittent problems.
- Isolate the logging so that it can be configured at the module or even class level. You should be able to turn on Debug level logging for just one piece of your code. This allows you to zero in on problems without being bombarded by useless debugging information from other parts of the system.
- Protect your system performance by wrapping your logging with if statements. Good log messages involve string manipulation which can be expensive.
- Provide real-time access to distributed logs. Since your system is running on multiple physical servers, you need to provide access to all logs. This could mean logging to a central repository or providing infrastructure to query multiple distributed log repositories
1.2 Categorize exceptions wisely
If your system is at all novel, you probably won't get your logging categories right the first time. Once you are up and running, you will discover that exceptions you thought should be errors are actually warnings and vice versa. You will realize that your logging is too verbose in certain areas, resulting in a flood of exception messages from just one simple failure mode. Other sections of code will remain infuriatingly silent when things go wrong. Here are a few suggestions for categorizing exceptions:
- Enforce consistency among the entire development team -- code reviews are great for this and coding standards are even better.
- Build in a patch feature or run time configuration mechanism so that you can re-categorize without making a new complete software release.
- Constantly search for errors to demote to warnings. Don't call it an error if no one cares or no action is needed.
- Constantly search for warnings to promote to errors. If the exception indicates a bug or a problem to be fixed, it is an error rather than a warning.
The following examples help illustrate the value of good categorization.
You have four separate trading partners, each providing similar services. Each partner service is a complicated construction of old and new business systems, exposed with varying degrees of success through XML interfaces. The reliability of each interconnection varies from a 3% to 15% failure rate. Each failure means lost revenue to you. Some of these failures are caused by your system, but the bulk of the errors happen inside their systems. As such, your duty is to inform your partners of the frequency of occurrence and work with them to resolve the issues.
![]() The failures fall into the following general categories:
Unfortunately, the error types 1 and 2 are constructed and sent to you by your partners. To make matters worse, your partners cannot provide you with a list of all possible error messages because they themselves don't know all possibilities. Symptom: Inconsistent exception categories |
|
You exchange messages with Partner X who only updates their zip code lookup database twice per year. You update your database every month. As a result, you have a 1% zip code mismatch rate. The Partner X project team has informed you that they aren't going to change their zip code update process.
Symptom: Zip code mismatch exception for some of your messages. Cause: Out-of-sync data update processes. Action: None -- as my mom used to say, "you'll just have to eat it and LIKE IT." You choose to leave the error in your weekly and monthly reports, but categorize it as "WONTFIX". |
A distributed transactional system should be responsible for reporting exceptions so you can debug problems, but it should never be required to notify people when something goes wrong. Instead, that task should be delegated to a completely separate and redundant monitoring system (Figure 1). The monitor should be able to absorb massive amounts of data and decide when to raise the alarm. Sometimes the decision is easy. Other times, usually at 2:00 a.m., the decision is more difficult.
Figure 1. Separate monitoring system

You can monitor your system with off-the-shelf tools (see Resources), but you will usually need to do some of your own automation as well. Here are some common monitoring activities:
Basic monitors. Set your monitoring system to poll your hardware regularly for basic system health information. It is best to log these measurements so that you can do post mortem analysis. Weekly or daily reports can help you spot trends in order to solve infrastructure problems before they become crises.
- Network Infrastructure: DNS, ping, routers, switches, bandwidth utilization
- Web infrastructure: HTTP server, database server
- Server: CPU load, memory, I/O
Custom monitors. Write your own monitors that test key points of failure in your system. Some ideas to consider:
- Implement a ping-style message for all point-to-point connections that are historically unreliable.
- Aggregate failure and success rates for frequently used connections.
- Aggregate timing information for resource intensive operations (for example: average, max, min, and standard deviation).
- Implement periodic end-to-end test transactions if possible. Many organizations do not allow this, since it can distort production reports, but it's nice to have if you can get it.
- Produce daily and weekly aggregate exception reports. Categorize and prioritize your exceptions by frequency of occurrence and severity.
- Record and fix exceptions using your bug tracking system.
The following examples illustrate some of these points:
One out of every five hundred transactions results in your system making a call out to a third-party service bureau for customer validation. Sometimes there is no message for several hours. This service has a history of being unreliable, but you've got to work with them. Your call center staff wants to know when this happens so they can use a manual (and more expensive) work-around.
![]() Symptom: Service is unavailable. |
Example 7: Real-time Exception Aggregation
As in Example 4, you have four separate trading partners, each providing similar services. Because of the complexity of their service, each of these partners experience soft failures where the error rate jumps from the normal 3 to 15% all the way to 50% or 75%.
![]() This causes products that are normally hot sellers to become inexplicably unavailable. When this happens, you need to inform your partners immediately so that they can remedy the situation. You've found that they don't always notice the problem until you notify them. Symptom: Soft Failure. |
Example 8: Timing Data Aggregation
|
You've successfully implemented a high volume system with thousands of transactions per minute. You begin to hear grumblings from your customers that the system is sometimes slow, but you don't have specific information about when and under what circumstances this occurs. You have a monitor that collects timing on the main Web page every five minutes, but it isn't showing anything out of the ordinary. Somebody complained to your boss's boss, so you've got to do something. You suspect the problem is caused by unexpected network traffic in an important subsystem.
Symptom: Rumors of slow performance. |
You've recently discovered that one of your partners has a periodic failure mode where messages time-out for about one minute. This occurs approximately six times per hour, once every ten minutes. You must convey this information to your partner and convince them to fix it.
![]() Symptom: Short burst of high failure rate every ten minutes. |
An automated monitoring system is only marginally useful if you don't have an equally automated response process to go with it. Many organizations enforce strict boundaries between data center and software engineering. Similar boundaries may exist between product marketing, customer support, sales, and other areas. The software solution presented here requires very close cooperation and partnership between all of these organizations. In particular, the software development team has the necessary understanding of the system architecture, the data center team understands availability constraints, and the business team, support team, customers, and partners all understand relative priorities of various features and failures.
When building your response process, consider these important features:
- Document your process and get commitment from all stakeholders. In large systems, you may have to deal with a Web that spans many organizations across many geographical and cultural boundaries.
- Create level of service goals for each system interface.
- Include contact information for all stakeholders.
- Document both physical and logical system architecture.
- Describe typical failure modes and the appropriate response for each.
- Create an automated notification and escalation scheme.
- Provide daily, weekly, and monthly reports on all measurements and alerts.
Example 10: Kelly turned off the monitor
|
You've created 100 different monitors for your system. Your team is extremely disciplined in managing configuration control of your source code base, but that discipline has not been extended to your monitoring system. Late one night, Kelly turns off an important monitor and forgets to turn it back on again. You don't discover this until it is too late -- a crisis could have been easily averted had the monitor been active. Unfortunately, you can't even find out when Kelly turned off the monitor or whether he changed anything else that night.
Symptom: Monitors are incorrectly configured (or disabled).
|
Example 11: Don't call us, we'll call you
As in Example 4, you have four separate trading partners, each providing similar services. Of the four, only Partner A tends to discover problems in their system before you do. Partner A has told you not to call if there is a problem; "we always know it before you do." Other partners prefer to be notified when your monitors detect a problem.
![]() Symptom: Partners B, C, and D often have to be told when their service has a problem.
For Partner A, you may decide to implement a policy to call them if you haven't heard from them within one hour of your noticing a problem. |
|
You've built a powerful monitoring and response system for your company. The data center staff can keep the machines running and operate the basic system, but the newness of the software and the rate at which you are adding new features leads you to assign software developers to support the operations staff.
Action: Each week, one developer is assigned to hold the pager. Pager duties include:
All developers must rotate through pager duty, though extra duty may be assigned to developers who check in code that isn't properly tested (break the build), fail to write unit tests, or don't review their code with the team. |
Advances in software development, network reliability, and system performance have made it much easier to create large distributed systems like the ones described in the examples. Along with this new-found level of integration, comes an equally new type of operational complexity. Under these circumstances, a good monitoring and response process is absolutely necessary and will help you to achieve your level of service goals and continuously improve your software.
The examples presented here actually happened, though the solutions weren't always obvious at the time.
Thanks to Tim McCune, Billy Lyvers, and Paul Kilroy for teaching me about exceptions, Tom LaStrange and Bill Schneider for comments and criticism.
- Check out the Sun JDK with its full-featured logging facility (JDK 1.4.2 Logging ) or see why the author likes the jakarta project's Commons Logging, which includes implementations of the JDK logging as well as the more familiar Log4j.
- Try Sitescope, one of many off-the-shelf application monitoring packages.
- Use Bugzilla to automate your response process or try JIRA, a great commercial issue system, with a full-featured API.
- Read "Writing good exceptions" for a very good discussion of error handling style (developerWorks May 2003).
- Explore exception handling and logging for distributed EJB systems in "Best practices in EJB exception handling" (developerWorks January 2003).
- Get a description of the J2SE logging API as well as basic exception handling in "Logging and Exceptions" (developerWorks December 2001).
Frank San Miguel has built complex distributed systems for 20 years and is now the principal of San Miguel Technology, a software development and consulting firm. During the time of the first Web browsers, he was an enthusiastic evangelist and principal architect of Mapquest.com. More recently, he helped to create a Web portal that integrated eight insurance companies for online sales. ContatctFrank at fsm@fsanmiguel.com.










