Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Monitoring and response for distributed systems

Managing the data from run-time failure

Frank Miguel (fsm@fsanmiguel.com), Principal, San Miguel Technology, LLC
Frank San Miguel has built complex distributed systems for 20 years and is now the principal of San Miguel Technology, a software development and consulting firm. During the time of the first Web browsers, he was an enthusiastic evangelist and principal architect of Mapquest.com. More recently, he helped to create a Web portal that integrated eight insurance companies for online sales. ContatctFrank at fsm@fsanmiguel.com.

Summary:  Building complex distributed systems is easier than ever, but many development and operations teams are not prepared for the volume of data that results from run-time failures. This paper describes a design and operations philosophy for monitoring and responding, which helps to manage these run-time failures. Real-world examples are used to illustrate the problems and solutions.

Date:  27 Apr 2004
Level:  Intermediate

Activity:  949 views
Comments:  

Introduction

You've just launched a major software initiative that integrates fifteen distinct systems among five different organizations into a real-time, first-of-its-kind business system. Customers and partners love it. Thousands of transactions are rolling in! But you've discovered that things don't always work as planned. Ten percent of the transactions fail for seemingly random reasons. External systems that are out of your control have unpredictable bugs that cause problems in your system. A nasty intermittent problem causes expensive downtime. You've come to dread any call from your operations team for fear of hearing about yet another elusive software problem. When debugging these problems, you feel like a sonar operator searching for an impossibly silent submarine among all the ocean noise.

This story is familiar to me because I've often neglected to adequately take into account run-time failure modes when designing my systems. And because I've had the opportunity to live with my mistakes and to learn from talented and creative professionals, I now subscribe to a design and operations philosophy for monitoring and responding which helps me to avoid some of my earlier mistakes. This paper describes these practices and provides some real-world examples to illustrate their usefulness.

If designed correctly, a good monitor and response system is both inexpensive and quick to implement and will allow you to:

  • Spot problems before they become crises
  • Appropriately respond to operational problems
  • Accurately prioritize software defects, within your system or in external systems
  • Measure system health and track improvements over time
  • Record key metrics and report them to management, clients, partners and suppliers
  • Efficiently identify enhancements and improvements, and accurately estimate the scope and impact of these changes

This article is primarily for software architects and developers who participate in the creation and operation of complex distributed systems. The solutions presented here apply most readily to systems with the following characteristics:

  • Responsibility is distributed across disparate functional organizations
  • Services are distributed across multiple geographic locations
  • Software and hardware components are heterogeneous
  • Inter-process communications are both synchronous and asynchronous
  • Requirements for availability and reliability are demanding
  • Consequences of system failure are significant

The body of this paper addresses three main topics:

  • Exceptions -- recording detailed information about problems
  • Monitoring -- using a separate system to keep an eye on things
  • Responding -- implementing automated processes for handling problems once they arise

Exceptions

Developers writing software in object-oriented languages like Java, C#, and C++ are familiar with exception handling techniques; but even if you are writing in a language without built-in compiler support for an Exception class, you've got to provide support for run-time errors. Run-time errors are unavoidable and are frequently caused by problems that are out of your control. For the purposes of this discussion, I'll define an exception as any problem that is automatically reported by your code. Some exceptions can be handled while others cannot. In distributed systems, often you can do very little with an exception other than to report that it happened. Exceptions are not bugs, though some exceptions are caused by bugs.

Common types of exceptions in a distributed system include:

  • Network error
  • Invalid message (missing or incorrect data, bad format, and others)
  • Timeout
  • Software exception (such as null pointer or div/0)

Common causes for exceptions are:

  • Incorrect software (bug)
  • Hardware / network failure
  • Improper configuration
  • Version mismatch
  • Data corruption (due to software or hardware failure see above)
  • Overloaded system

The following examples illustrate how exceptions vary in their severity, and the ways in which to respond. Some very severe exceptions, like the transport exception in Example 1, are easily diagnosed, while others (Example 2) may take days to debug. Less significant, but more irritating, intermittent problems like Example 3 may be easily diagnosed, but only if you have the proper monitoring and response infrastructure already in place.

Example 1. Network Failure

A router failure causes two of your four partners to be unavailable. Example 1

Symptom: Transport Exception for some of your messages.
Cause: Hardware Failure (network router).
Action: Notify someone right away!!

Example 2. Invalid XML

Your trading partner experienced some unplanned maintenance in their mainframe rating system. Their quoting service still returns a message, but the XML is garbled with invalid formatting, since they are passing the entire mainframe error message back in the XML. Example 2

Symptom: Invalid XML exception
Cause: Rating system outage.
Debugging: Send your data partner the entire request message, response message, the date/time and the conditions under which this error occurred.
Action: Convince your data partner to properly fix their XML by encapsulating the mainframe error in XML CDATA tags so you can record the real cause of the problem.

Example 3. Incorrect Mapping

You exchange messages with a trading partner. An error in mapping causes your transformation process to generate a null pointer error, but only with certain obscure combinations of data. Example 3

Symptom: Null pointer exception.
Cause: Incorrect mapping.
Action: You must log the entire content of the message. Since this only happens intermittently, you will have to decide whether to record all messages or create a run-time test that records messages only when exceptions occur. Fix the mapping, add a unit test to prevent this from happening in the future, cut a new release (or a patch release), install the fix and verify in production.

1.1 Log exceptions and errors

Logging exceptions have well established design patterns (see Resources), but the following list provides a quick overview of some good practices for distributed systems:

  • Know your audience. Provide enough information to help them discover the root cause of the problem. Be careful not to provide so much information that the responder is overwhelmed with trivial data.
  • Support multiple levels of logging -- at a minimum, provide Error, Warn, Info, Debug.
  • Provide a feature for changing logging settings at run-time. You shouldn't have to restart the system to change logging settings. This is especially important when debugging intermittent problems.
  • Isolate the logging so that it can be configured at the module or even class level. You should be able to turn on Debug level logging for just one piece of your code. This allows you to zero in on problems without being bombarded by useless debugging information from other parts of the system.
  • Protect your system performance by wrapping your logging with if statements. Good log messages involve string manipulation which can be expensive.
  • Provide real-time access to distributed logs. Since your system is running on multiple physical servers, you need to provide access to all logs. This could mean logging to a central repository or providing infrastructure to query multiple distributed log repositories

1.2 Categorize exceptions wisely

If your system is at all novel, you probably won't get your logging categories right the first time. Once you are up and running, you will discover that exceptions you thought should be errors are actually warnings and vice versa. You will realize that your logging is too verbose in certain areas, resulting in a flood of exception messages from just one simple failure mode. Other sections of code will remain infuriatingly silent when things go wrong. Here are a few suggestions for categorizing exceptions:

  • Enforce consistency among the entire development team -- code reviews are great for this and coding standards are even better.
  • Build in a patch feature or run time configuration mechanism so that you can re-categorize without making a new complete software release.
  • Constantly search for errors to demote to warnings. Don't call it an error if no one cares or no action is needed.
  • Constantly search for warnings to promote to errors. If the exception indicates a bug or a problem to be fixed, it is an error rather than a warning.

The following examples help illustrate the value of good categorization.

Example 4. Error Normalizer

You have four separate trading partners, each providing similar services. Each partner service is a complicated construction of old and new business systems, exposed with varying degrees of success through XML interfaces. The reliability of each interconnection varies from a 3% to 15% failure rate. Each failure means lost revenue to you. Some of these failures are caused by your system, but the bulk of the errors happen inside their systems. As such, your duty is to inform your partners of the frequency of occurrence and work with them to resolve the issues. Example 4

The failures fall into the following general categories:

  1. Product unavailable
  2. Error within Partner system
  3. Error processing response message (mapping or transformation)
  4. Error creating request message (mapping or transformation)

Unfortunately, the error types 1 and 2 are constructed and sent to you by your partners. To make matters worse, your partners cannot provide you with a list of all possible error messages because they themselves don't know all possibilities.

Symptom: Inconsistent exception categories
Cause: Exceptions are passed to you by disparate organizations
Action: To appropriately handle and report on these errors, you must normalize the error messages by testing them for key phrases and substrings and then assigning them to the appropriate category. As you discover more new and interesting error conditions reported by your partners, you can add to the growing list of known species of errors. Any new errors introduced by your partners are flagged as uncategorized and evaluated for severity and appropriate response. Your weekly reports to management allow you to rank each partner in terms of the effectiveness of their service.

Example 5: WONT_FIX Errors

You exchange messages with Partner X who only updates their zip code lookup database twice per year. You update your database every month. As a result, you have a 1% zip code mismatch rate. The Partner X project team has informed you that they aren't going to change their zip code update process.

Symptom: Zip code mismatch exception for some of your messages.
Cause: Out-of-sync data update processes.
Action: None -- as my mom used to say, "you'll just have to eat it and LIKE IT." You choose to leave the error in your weekly and monthly reports, but categorize it as "WONTFIX".

Monitoring

A distributed transactional system should be responsible for reporting exceptions so you can debug problems, but it should never be required to notify people when something goes wrong. Instead, that task should be delegated to a completely separate and redundant monitoring system (Figure 1). The monitor should be able to absorb massive amounts of data and decide when to raise the alarm. Sometimes the decision is easy. Other times, usually at 2:00 a.m., the decision is more difficult.

Figure 1. Separate monitoring system

Example 4

You can monitor your system with off-the-shelf tools (see Resources), but you will usually need to do some of your own automation as well. Here are some common monitoring activities:

Basic monitors. Set your monitoring system to poll your hardware regularly for basic system health information. It is best to log these measurements so that you can do post mortem analysis. Weekly or daily reports can help you spot trends in order to solve infrastructure problems before they become crises.

  • Network Infrastructure: DNS, ping, routers, switches, bandwidth utilization
  • Web infrastructure: HTTP server, database server
  • Server: CPU load, memory, I/O

Custom monitors. Write your own monitors that test key points of failure in your system. Some ideas to consider:

  • Implement a ping-style message for all point-to-point connections that are historically unreliable.
  • Aggregate failure and success rates for frequently used connections.
  • Aggregate timing information for resource intensive operations (for example: average, max, min, and standard deviation).
  • Implement periodic end-to-end test transactions if possible. Many organizations do not allow this, since it can distort production reports, but it's nice to have if you can get it.
  • Produce daily and weekly aggregate exception reports. Categorize and prioritize your exceptions by frequency of occurrence and severity.
  • Record and fix exceptions using your bug tracking system.

The following examples illustrate some of these points:

Example 6: Service Ping

One out of every five hundred transactions results in your system making a call out to a third-party service bureau for customer validation. Sometimes there is no message for several hours. This service has a history of being unreliable, but you've got to work with them. Your call center staff wants to know when this happens so they can use a manual (and more expensive) work-around. Example 6

Symptom: Service is unavailable.
Cause: Unreliable partner.
Action: Implement a simple ping message for the service. Ping every five minutes, provide an alert mechanism for your call center and operations staff to help them prepare for manual work-arounds.

Example 7: Real-time Exception Aggregation

As in Example 4, you have four separate trading partners, each providing similar services. Because of the complexity of their service, each of these partners experience soft failures where the error rate jumps from the normal 3 to 15% all the way to 50% or 75%. Example 7

This causes products that are normally hot sellers to become inexplicably unavailable. When this happens, you need to inform your partners immediately so that they can remedy the situation. You've found that they don't always notice the problem until you notify them.

Symptom: Soft Failure.
Cause: Sub-system failure inside your partner's service.
Action: Build a Real-time Exception Aggregator. This is a custom monitor that summarizes the number of occurrences of exceptions within a given time window. Make it so you can configure thresholds, minimum and maximum sample size, time windows, and such. Like the knobs on the sonar console, you will need these adjustments to filter out the noise from the real problems. It's best to build this monitor in two parts: the aggregator and the monitor. The aggregator collects the data and provides summary reports based on the settings you choose. The monitor periodically requests reports and raises the alarm when the situation merits human intervention.

Example 8: Timing Data Aggregation

You've successfully implemented a high volume system with thousands of transactions per minute. You begin to hear grumblings from your customers that the system is sometimes slow, but you don't have specific information about when and under what circumstances this occurs. You have a monitor that collects timing on the main Web page every five minutes, but it isn't showing anything out of the ordinary. Somebody complained to your boss's boss, so you've got to do something. You suspect the problem is caused by unexpected network traffic in an important subsystem.

Symptom: Rumors of slow performance.
Cause: Unexpected subsystem performance problem.
Action: Build a Real-time Timing Aggregator. This is a custom monitor that aggregates average, max, min, and standard deviation time measurements for a given time interval. Deploy this for three key timing metrics in your system. Like the custom monitor in Example 7, build it in two pieces. When the average wait time exceeds a certain threshold, your monitor will raise the alarm and you will get all hands to look at the system to find the source of the problem. Once the problem is isolated, you can keep the monitor active for another day (much like the way you make a permanent unit test for each bug you fix).

Example 9: Periodic Failures

You've recently discovered that one of your partners has a periodic failure mode where messages time-out for about one minute. This occurs approximately six times per hour, once every ten minutes. You must convey this information to your partner and convince them to fix it. Example 9

Symptom: Short burst of high failure rate every ten minutes.
Cause: Partner's system is overloaded by batch processes every ten minutes.
Action: Produce hourly, daily, and weekly error reports for this partner. Summarize errors by category and time period, including percentage of total for all types. A special hourly report will highlight the fact that errors occur at ten-minute intervals. Now that you have this report, you can automatically generate and distribute this to your software team each week (or day?). Rotate your entire team through the responsibility of reviewing the statistics. Use the report to continually re-prioritize all of your bugs and partner issues.


Responding

An automated monitoring system is only marginally useful if you don't have an equally automated response process to go with it. Many organizations enforce strict boundaries between data center and software engineering. Similar boundaries may exist between product marketing, customer support, sales, and other areas. The software solution presented here requires very close cooperation and partnership between all of these organizations. In particular, the software development team has the necessary understanding of the system architecture, the data center team understands availability constraints, and the business team, support team, customers, and partners all understand relative priorities of various features and failures.

When building your response process, consider these important features:

  • Document your process and get commitment from all stakeholders. In large systems, you may have to deal with a Web that spans many organizations across many geographical and cultural boundaries.
  • Create level of service goals for each system interface.
  • Include contact information for all stakeholders.
  • Document both physical and logical system architecture.
  • Describe typical failure modes and the appropriate response for each.
  • Create an automated notification and escalation scheme.
  • Provide daily, weekly, and monthly reports on all measurements and alerts.

Example 10: Kelly turned off the monitor

You've created 100 different monitors for your system. Your team is extremely disciplined in managing configuration control of your source code base, but that discipline has not been extended to your monitoring system. Late one night, Kelly turns off an important monitor and forgets to turn it back on again. You don't discover this until it is too late -- a crisis could have been easily averted had the monitor been active. Unfortunately, you can't even find out when Kelly turned off the monitor or whether he changed anything else that night.

Symptom: Monitors are incorrectly configured (or disabled).
Cause: Poor configuration control.
Action: Implement an automated configuration control system for your monitor and response process. Some important features:

  • Create a log of all modifications to monitor configurations (who, when, before, after)
  • Provide multiple levels of access control
  • Provide automatic notification when changes are made
  • Support rollback to previous versions

Example 11: Don't call us, we'll call you

As in Example 4, you have four separate trading partners, each providing similar services. Of the four, only Partner A tends to discover problems in their system before you do. Partner A has told you not to call if there is a problem; "we always know it before you do." Other partners prefer to be notified when your monitors detect a problem. Example 11

Symptom: Partners B, C, and D often have to be told when their service has a problem.
Cause: Partners have varying levels of internal operational control.
Action: Implement a customized problem resolution process for each partner. Agree upon key criteria such as:

  • Definitions of expected failure modes
  • Escalation process
  • Contact list
  • Hours of availability
  • Down-time schedule (if any)
  • Rules for notification

For Partner A, you may decide to implement a policy to call them if you haven't heard from them within one hour of your noticing a problem.

Example 12: Pager Duty

You've built a powerful monitoring and response system for your company. The data center staff can keep the machines running and operate the basic system, but the newness of the software and the rate at which you are adding new features leads you to assign software developers to support the operations staff.

Action: Each week, one developer is assigned to hold the pager. Pager duties include:

  • 24/7 -- Be available to support data center during severe outages caused by software problems (this should be extremely infrequent).
  • Normal workday -- Respond within one hour to any software support requests from the operations staff. Allocate an average of one hour per day to supporting operations staff, including customer support and sys admin support for debugging high priority problems.
  • Daily -- Review all error and warning summary reports. File new bugs based on uncategorized errors, demote errors and promote warnings.

All developers must rotate through pager duty, though extra duty may be assigned to developers who check in code that isn't properly tested (break the build), fail to write unit tests, or don't review their code with the team.


In conclusion

Advances in software development, network reliability, and system performance have made it much easier to create large distributed systems like the ones described in the examples. Along with this new-found level of integration, comes an equally new type of operational complexity. Under these circumstances, a good monitoring and response process is absolutely necessary and will help you to achieve your level of service goals and continuously improve your software.

The examples presented here actually happened, though the solutions weren't always obvious at the time.

Acknowledgements

Thanks to Tim McCune, Billy Lyvers, and Paul Kilroy for teaching me about exceptions, Tom LaStrange and Bill Schneider for comments and criticism.


Resources

  • Check out the Sun JDK with its full-featured logging facility (JDK 1.4.2 Logging ) or see why the author likes the jakarta project's Commons Logging, which includes implementations of the JDK logging as well as the more familiar Log4j.

  • Try Sitescope, one of many off-the-shelf application monitoring packages.

  • Use Bugzilla to automate your response process or try JIRA, a great commercial issue system, with a full-featured API.

  • Read "Writing good exceptions" for a very good discussion of error handling style (developerWorks May 2003).

  • Explore exception handling and logging for distributed EJB systems in "Best practices in EJB exception handling" (developerWorks January 2003).

  • Get a description of the J2SE logging API as well as basic exception handling in "Logging and Exceptions" (developerWorks December 2001).

About the author

Frank San Miguel has built complex distributed systems for 20 years and is now the principal of San Miguel Technology, a software development and consulting firm. During the time of the first Web browsers, he was an enthusiastic evangelist and principal architect of Mapquest.com. More recently, he helped to create a Web portal that integrated eight insurance companies for online sales. ContatctFrank at fsm@fsanmiguel.com.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development
ArticleID=11905
ArticleTitle=Monitoring and response for distributed systems
publish-date=04272004
author1-email=fsm@fsanmiguel.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers