Using good results to do something even better
In an early article called Proactive Application Monitoring, I wrote about the benefits of application monitoring and outlined the best metrics to capture in a typical IBM® WebSphere® Application Server environment. The objective of that article was to enable you to detect an anomaly, jiggle a server or application, and ultimately prevent the problem before any user ever got a chance to notice. And that’s great. But can we do even better?
Is it possible for operations to reduce the mean time to repair (MTTR) and increase the mean time between failures (MTBR), all before users know anything has happened? If monitoring and operations processes have matured to the point that you can stop problems before they impact users and live operations, there must be more that you can do with this capability to fix more than just immediate problems.
All monitors work on some level of granularity, as most organizations would be hard pressed to find storage for all the data (much less the horsepower) to analyze and crunch all those numbers. Some monitors collect metrics like CPU and heap size every few seconds. Others, such as those that typically perform polling (like sending an HTTP request to pull a specific page), might only run every 15 minutes.
Naturally, there is a direct correlation between granularity and how quickly a problem is detected. If an HTTP request is sent only once every 15 minutes for Web analytics, then the best you can anticipate is being able to detect a problem every 15 minutes. For a site trying to proactively detect when an error has occurred, this might be too long an interval to wait; probability suggests that users will likely encounter the problem before the monitoring environment does. This leaves little time for troubleshooting and applying a fix before the problem begins to impact the user. A finer granularity of monitoring might be needed in this case. However, running with finer granularity means more activity on the application, which in turn means more data that needs to be collected and stored. Additional storage — although not as expensive as it once was — still needs to be managed, maintained, powered, and backed up.
If the granularity of the monitors is too large and problems are not being detected quickly enough, then you need to come up with a way to accommodate smaller time intervals between tests.
It's a cliche, but history does often repeat itself, and this premise holds true in most application environments. A problem that happens today could be one you will see again in six months or so. A common example I point to is a log disk that gets full (although someone should be monitoring disk space, but that’s for another article). When an application can no longer write to its logs the entire application hangs. If several cluster members are pointed to the same shared log space, then the entire application environment deteriorates rapidly until the application stops. Operations can take a few minutes to troubleshoot, determine what the problem is, and then take steps to correct it.
Problems should be recorded in an operations runbook and include details that describe the problem, what components participated, what people were involved in determining the problem, and what corrective actions were taken. In many cases, it could be appropriate for this documentation to be reviewed by senior management so they can understand how prior business decisions — in this example, the decision to share resources — might have negatively affected the entire application environment and determine if any adjustments should be applied to prevent future occurrences of the same problem. Of course, some problems are the result of not having enough IT staff to participate in troubleshooting and fixing actions.
Access to testing
Having a good test environment is one of the best ways to gain experience working with monitoring tools, granularity, and operational procedures. You get to see firsthand how the tools perform, what impact they have on the applications, and whether the desired granularity is adequate.
Health reporting: Business versus IT
There are at least two factions in any organization that are interested in the health of the processing environment; one side interested in business metrics (how many orders were placed, how many claims were processed, were service level agreements met, and so on), the other in operational metrics (how many component failures occurred, how many applications were restarted, and so on). And the two reports might not necessarily correlate with each other; for example, IT might be concerned because one cluster of Help system applications was down for a considerable part of the day, but the business side is less concerned because the Help system is not considered business-critical.
Operations needs to have a good understanding of business service level agreements and how they differ from operational ones, and therefore know how daily monitoring data should be presented to the different audiences that need this information.
IT operations are under increased pressure to improve efficiencies with less. Going back and reviewing operational readiness and understanding where deficiencies exist will help an organization find better ways of getting monitoring activities tuned to provide an even better and more reliable environment.
- Proactive Application Monitoring
- IBM's Coremetrics Web Analytics
- Wikipedia: Runbook
- IBM developerWorks WebSphere