Java run-time monitoring, Part 1

Run-time performance and availability monitoring for Java systems

Techniques and patterns

Content series:

This content is part # of # in the series: Java run-time monitoring, Part 1

Stay tuned for additional content in this series.

This content is part of the series:Java run-time monitoring, Part 1

Stay tuned for additional content in this series.

Many contemporary Java applications rely on a complex set of distributed dependencies and moving parts. Numerous external factors can have an impact on your application's performance and availability. These influences are virtually impossible to eliminate completely or account for and accurately emulate in a preproduction environment. Stuff happens. But you can significantly reduce the severity and duration of these events by creating and maintaining a comprehensive system that monitors your application's entire ecosystem.

This three-part article presents some patterns and techniques for implementing such a system. The patterns, and some of the terminology I'll use, are purposely generic. In concert with the code samples and illustrations, they'll help give you a conceptual understanding of application performance monitoring. This understanding emphasizes the need for a solution and in turn helps you select a commercial or open source solution, extend and customize one, or — for the motivated — serve as a blueprint for building one.

Part 1:

  • Explores attributes of application performance management (APM) systems
  • Describes common antipatterns for system monitoring
  • Presents methods for monitoring the performance of JVMs
  • Offers techniques for efficiently instrumenting application source code

Part 2 will focus on methods of instrumenting Java classes and resources without modification of the originating source code. Part 3 will address methods for monitoring resources outside the JVM, including hosts and their operating systems and remote services such as databases and messaging systems. It will conclude with a discussion of additional APM issues such as data management, data visualization, reporting, and alerting.

APM systems: Patterns and antipatterns

To start out on the right foot, I should emphasize that although much of the Java-specific content I present here may seem analogous to the process of application and code profiling, that's not what I'm referring to. Profiling is an extremely valuable preproduction process that can confirm or contraindicate that your Java code is scalable, efficient, fast, and generally wonderful. But based on the stuff happens axiom, a blue ribbon of approval from a development-phase code profiler will not serve you when you encounter inexplicable issues in production.

What I am referring to is implementing some of the aspects of profiling in production and collecting some of the same data in real time from your running application and all of its external dependencies. This data consists of a series of ongoing quantitative measurements that are pervasive in their targets to provide a granular and detailed representation of the whole system's health. And, by retaining a historical store of these measurements, you capture accurate baselines that can help you either confirm that the environment remains healthy or pinpoint the root cause and size of a specific shortfall.

Monitoring antipatterns

It is probably a rare application that has no monitoring resources at all, but consider these antipatterns, which are frequently present in operational environments:

  • Blind spots: Some system dependencies are not monitored, or monitoring data is inaccessible. An operational database can have full monitoring coverage, but if the supporting network does not, a fault in the network will effectively be hidden while a triage team pores over database-performance and application-server symptoms.
  • Black boxes: The core application or one of its dependencies may not have monitoring transparency to its internals. The JVM is effectively a black box. For example, a triage team investigating unexplained latency in a JVM that has only the supporting operating-system statistics, such as CPU utilization or the memory size of the process to work with, may not be able to diagnose a garbage-collection or thread-synchronization problem.
  • Disjointed and disconnected monitoring systems: An application may be hosted in a large shared data center where the dependencies comprise a number of shared resources, such as databases, storage-area network (SAN) storage, or messaging and middleware services. Organizations are sometimes highly siloed, with each group managing its own monitoring and APM systems (see the Pitfalls of siloed monitoring sidebar). Without a consolidated view of each dependency, each component owner sees only a small piece of the whole.

    Figure 1 contrasts siloed and consolidated APM systems:

    Figure 1. Siloed vs. consolidated APM systems
    Siloed vs. consolidated APM systems
    Siloed vs. consolidated APM systems
  • After-the-fact reporting and correlation: In an attempt to address issues with siloed monitoring, an operational support team may run periodic processes to acquire data from various sources, consolidate the data in one place after the fact, and then generate summary reports. This approach can be inefficient or impractical to execute on a regular frequency, and the lack of real-time consolidated data can have a negative impact on a triage team's ability to diagnose an issue on the spot. Furthermore, after-the-fact aggregation can lack sufficient granularity, resulting in the hiding of important patterns in the data. For example, a report may show that a particular service invocation had an average elapsed time of 200 milliseconds yesterday, while concealing the fact that between 1:00 p.m. and 1:45 p.m. it was regularly clocking in at over 3500 milliseconds.
  • Periodic or on-demand monitoring: Because some monitoring tools impose a high resource overhead, they cannot (or should not) be run constantly. As a result, they rarely collect data, or only after a problem has been detected. As a result, the APM system performs minimal baselining, can't alert you to an issue before its severity becomes intolerable, and may itself exacerbate the condition.
  • Nonpersisting monitoring: Many monitoring tools provide a useful live display of performance and availability metrics, but they are not configured for, or do not support, the capability to persist measurements for long- or short-term comparison and analysis. Frequently, in the absence of a historical context, performance metrics have little or no worth because there's no basis on which to judge if the value of the metrics is good, bad, or abysmal. For example, consider a current CPU utilization level of 45 percent. Without knowing what the utilization was during heavy or light periods of load in the past, such a measurement is much less informative than knowing that the typical value is x percent and the upper bound for acceptable user performance has historically been y percent.
  • Reliance on preproduction modeling: The practice of relying exclusively on preproduction monitoring and system modeling, with the assumption that all potential issues can be weeded out of the environment before production deployment, often leads to insufficient run-time monitoring. This assumption fails to account for unpredicted events and dependency failures, leaving triage teams with no tools or data to work with when such events occur.

The implementation of a consolidated APM does not preclude or devalue highly specific monitoring and diagnosis tools such as DBA administrative toolsets, low-level network-analysis applications, and data center management solutions. These tools remain invaluable resources, but if they're relied on to the exclusion of a consolidated view, the silo effect is difficult to overcome.

The ideal APM system's attributes

In contrast to the antipatterns I've just described, the ideal APM system this article series presents has the following attributes:

  • Pervasive: It monitors all application components and dependencies.
  • Granular: It can monitor extremely low-level functions.
  • Consolidated: All collected measurements are routed to the same logical APM supporting a consolidated view.
  • Constant: It monitors 24 hours a day, 7 days a week.
  • Efficient: The collection of performance data does not detrimentally influence the target of the monitoring.
  • Real-time: The monitored resource metrics can be visualized, reported, and alerted on in real time.
  • Historical: The monitored resource metrics are persisted to a data store so historical data can be visualized, compared, and reported.

Before I delve into this system's implementation details, though, it will help to understand some of the generic aspects of APM systems.

APM system concepts

All APM systems access performance data sources and include facilities for collecting and tracing. Note that these are generic terms of my own choosing that describe general categories. They are not specific to any particular APM system(s), which may use other terminology for the same concepts. I'll use these terms throughout the rest of this article, based on the following definitions.

Performance data source

A performance data source (PDS) is a source of performance or availability data that is useful as a measurement to reflect a component's relative health. For example, Java Management Extensions (JMX) services can typically provide a wealth of data about the health of a JVM. Most relational databases publish performance data through an SQL interface. Both of these PDSs are examples are what I refer to as direct sources; the source supplies the performance data directly. In contrast, inferential sources measure a deliberate or incidental action, and performance data is derived from it. For example, a test message can periodically be sent and then retrieved from a Java Message Service (JMS) server, and the round-trip time is then an inferential measurement of that service's performance.

Inferential sources (an instance of which is referred to as a synthetic transaction) can be extremely useful because they can effectively measure multiple components or tiered invocations by traveling the same pathways as real activity. Synthetic transactions also play a key role in monitoring continuity to confirm a system's health during periods of relative inactivity, when direct sources are insufficient.

Collecting and collectors

Collecting is the process of acquiring performance or availability data from a PDS. In the case of a direct PDS, a collector typically implements some sort of API to access that data. To read statistics from a network router, a collector might use Simple Network Management Protocol (SNMP) or Telnet. In the case of an inferential PDS, the collector executes and measures the underlying action.

Tracing and tracers

Tracing is the process of delivering measurements from the collector to the core APM system. Many commercial and open source APM systems provide some sort of API for this purpose. For the examples in this article, I have implemented a generic Java tracer interface, which I'll review in more detail in the next section.

Most APM systems typically organize data submitted by tracers into some sort of categorized and hierarchical structure. Figure 2 illustrates the general flow of this data capture:

Figure 2. Collecting, tracing, and the APM system
Collecting, tracing, and the APM system
Collecting, tracing, and the APM system

Figure 2 also presents some of the commonly provided services in APM systems:

  • Live visualization: Graphs and charts that display selected metrics in near real time.
  • Reporting: Generated reports of metric activity. These typically include a collection of canned reports, custom reports, and the ability to export data for use elsewhere.
  • Historical store: A historical data store containing raw or summary metrics so that visualization and reporting can be viewed for a specific time frame.
  • Alerting: The capability to notify interested individuals or groups about a specific condition determined from the collected metrics. Typical alerting methods are e-mail and some sort of custom hook interface to allow operation teams to propagate events into an event-processing system.

The implementation and use of a common tracing API throughout an APM's target environment provides some consistency. And, for the purposes of customized collectors, it lets the developer focus on acquiring performance data without needing to worry about the tracing aspects. The next section introduces an APM tracing interface that addresses this topic.

ITracer: A tracer interface

The Java language serves well as an implementation language for collectors because of its:

  • Wide platform support. Java collector classes can run unmodified on most target platforms. This gives a monitoring architecture the flexibility to colocate collector processes locally with the PDS and not mandate remote collection.
  • Typically excellent performance (although it varies with available resources).
  • Robust concurrency and asynchronous execution support.
  • Support for a rich set of communication protocols.
  • Broad support from third-party APIs such as JDBC implementations, SNMP, and proprietary Java interfaces, which in turn support a diverse library of collectors.
  • Support from an active open source community that provides additional tools and interfaces for the language to access or derive data from a huge number of sources.

One caveat, however, is that your Java collectors must be able to integrate with the tracing API supplied by your target APM system. If your APM's tracing mechanism does not provide a Java interface, some of these patterns will still apply. But in cases in which the target PDS is exclusively Java-based (such as JMX) and your application platform is not, you'll need a bridging interface such as IKVM, a Java-to-.NET compiler (see Related topics).

The tracing APIs supplied by different APM products are all different, in the absence of an official standard. So I have abstracted the issue by implementing a generic tracing Java interface called org.runtimemonitoring.tracing.ITracer. The ITracer interface is a generic wrapper for proprietary tracing APIs. This technique protects the source base from changes in versions or API providers, and it also presents the opportunity to implement additional functionality not available in the wrapped API. Most of this article's remaining examples implement the ITracer interface and the general underlying concepts it supports.

Figure 3 is a UML class diagram of the org.runtimemonitoring.tracing.ITracer interface:

Figure 3. ITracer interface and factory class
ITracer interface and factory class
ITracer interface and factory class

Trace categories and names

The root premise of ITracer is to submit a measurement and an associated name to the central APM system. This activity is implemented by the trace methods, which vary in accordance with the nature of the submitted measurement. Each trace method accepts a String[] name parameter that contains the contextual components of a compound name, the structure of which is specific to the APM system. The compound name indicates to the APM system both the namespace of the submission and the actual metric name; so a compound name usually has at least a root category and a measurement description. The underlying ITracer implementation should know how to build the compound name from the passed String[]. Table 1 illustrates two examples of compound naming conventions:

Table 1. Example compound names
Name structureCompound name
Simple slash-delimitedHosts/SalesDatabaseServer/CPU Utilization/CPU3
JMX MBean ObjectNamecom.myco.datacenter.apm:type=Hosts,service=SalesDatabaseServer,group=CPU Utilization,instance=CPU3

Listing 1 is an abbreviated example of tracing calls using this API:

Listing 1. Example tracing API calls
ITracer simpleTracer = TracerFactory.getInstance(sprops);
ITracer jmxTracer = TracerFactory.getInstance(jprops);
simpleTracer.trace(37, "Hosts", "SalesDatabaseServer",
   "CPU Utilization", "CPU3", "Current Utilization %");
   "group=CPU Utilization", 
   "instance=CPU3", "Current Utilization %");

Tracer measurement data types

In this interface, a measurement can have one of the following types:

  • int
  • long
  • java.util.Date
  • String

APM system providers might support other data types for collected measurements.

Tracer types

Given one specific measurement data type (such as long), the given value can be interpreted in different ways depending on the type support in the APM system. Also keep in mind that each APM implementation may use different terminology for essentially the same type, and ITracer uses some generic naming.

The tracer types represented in ITracer are:

  • Interval averaged: The trace(long value, String[] name) and trace(int value, String[] name) methods issue traces for interval averaged values (see the Intervals sidebar). This means that each submission is factored into the current interval's aggregate values. Once a new interval starts, the aggregate value counters are reset to zero.
  • Sticky: The traceSticky(value long, String[] name) and traceSticky(value int, String[] name) methods issue traces for sticky values. This means that in contrast with interval averaged metrics, the aggregates retain their values across intervals. If I trace a value of 5 now, and then I do not trace again until tomorrow sometime, that metric stays perpetually at 5 until a new value is supplied.
  • Deltas: A delta trace passes in a number, but the actual value supplied to (or interpreted by) the APM system is the delta between this measurement and the preceding one. These are sometimes referred to as rate types, which reflects what they are good for. Consider a measurement of a transaction manager's total number of commits. This number always increases, and most likely its absolute value is not useful. The number's useful aspect is the rate at which it increases, so collecting the absolute number on a regular period and tracing the delta between readings reflects the rate of transaction commits. Delta traces come in interval averaged and sticky flavors, although few use cases are interval averaged. Delta traces must be able distinguish measurements that are expected only to increment from measurements that both increment and decrement. Submitted measurements that are less than the prior value should either be ignored or cause a reset of the underlying delta.
  • Incident: This type is a simple nonaggregated metric that is the incrementing count of how many times a specific event occurred in an interval. Because neither the collector or the tracer would be expected to know what the running total is at any given time, the base traceIncident(String[] name) call has no value, and a tick of one incident is implicit. In preference to calling that method several times in a loop when you want a tick of more than one, the traceIncident(int value, String[] name) method ticks the total up by value.
  • Smart: The smart tracer is a parameterized type that maps to one of the other types in the tracer. The measurement's value and the tracing type are passed in as Strings, and the available types are defined as constants in the interface. This is a convenience method for scenarios in which the collector has no idea what the data type or tracer type of the data being collected is, but can be directed simply to pass the collected value and a configured type name to the tracer.

The TracerFactory is a generalized factory class used to create a new ITracer instance based on the configuration properties passed or to reference a created ITracer from cache.

Collector patterns

Collectors typically use one of three patterns, which influences the tracer type that should be used:

  • Polling: The collector is invoked on a regular frequency, and it retrieves and traces the current value of a metric or set of metrics from a PDS. For example, a collector might be invoked every minute to read a host's CPU utilization or read the total number of committed transactions from a transaction manager through a JMX interface. The premise of a polling pattern is a periodic sampling of a target metric. So on a polling event, the metric's value is supplied to the APM system, but for the duration of the intermediate periods, the value is assumed to be unchanged. Accordingly, polling collectors typically use sticky tracer types: the APM system reports the value as unchanged in between all polling events. Figure 4 illustrates this pattern:
    Figure 4. Polling collection pattern
    Polling collection pattern
    Polling collection pattern
  • Listening: This general data pattern is a form of the Observer pattern. The collector registers itself as a listener of events with the target PDS and receives a callback whenever the event of interest occurs. The possible traced values issued as a result of the callback depend on the content of the callback payload itself, but at the least the collector can trace an incident for every callback. Figure 5 illustrates this pattern:
    Figure 5: Listening collection pattern
    Listening collection pattern
    Listening collection pattern
  • Interception: In this pattern, the collector inserts itself as an interceptor between a target and its caller or callers. On each instance of activity that passes through the interceptor, it makes a measurement and traces it. In cases in which the interception pattern is request/response, the collector can measure the number of requests, the response time, and possibly some measurement of the payload of the request or response. For example, an HTTP proxy server that also serves as a collector can:
    • Count requests, optionally demarcating by HTTP request type (GET, POST, and so on) or Uniform Resource Identifier (URI).
    • Time the response of requests.
    • Measure the size of the request and response.
    Because you can assume that an intercepting collector "sees" every event, the tracer type implemented would usually be interval averaged. Accordingly, if an interval expires with no activity, the aggregate values for that interval will be zero regardless of the activity in the prior interval. Figure 6 illustrates this pattern:
    Figure 6. Intercepting collection pattern
    Intercepting collection pattern
    Intercepting collection pattern

Now that I've outlined the performance data tracing API, its underlying data types, and the patterns of data collection, I'll present some specific use cases and examples that put the API to work.

Monitoring the JVM

The JVM itself is a sensible place to start implementing performance monitoring. I'll start with performance metrics common to all JVMs and then move on to some JVM-resident components typically seen in an enterprise application. With few exceptions, instances of Java applications are processes supported by an underlying operating system, so several aspects of JVM monitoring are best viewed from the hosting OS perspective, which I'll cover in Part 3.

Until the release of Java Platform, Standard Edition 5 (Java SE), internal and standardized JVM diagnostics that can be efficiently and reliably collected at run time were fairly limited. Now, several useful monitoring points are available through the interface, which is standard in all compliant Java SE 5 (and newer) JVM versions. Some implementations of these JVMs supply additional proprietary metrics, but the access patterns are more or less the same. I'll focus on the standard ones that you can access through the JVM's MXBeans — JMX MBeans deployed inside the VM that expose a management and monitoring interface (see Related topics):

  • ClassLoadingMXBean: Monitors the class loading system.
  • CompilationMXBean: Monitors the compilation system.
  • GarbageCollectionMXBean: Monitors the JVM's garbage collectors.
  • MemoryMXBean: Monitors the JVM's heap and nonheap memory spaces.
  • MemoryPoolMXBean: Monitors memory pools allocated by the JVM.
  • RuntimeMXBean: Monitors the runtime system. This MXBean offers few useful monitoring metrics, but it does provide the JVM's input arguments and the start time and up time, both of which can be useful as factors in other derived metrics.
  • ThreadMXBean: Monitors the threading system.

The premise of a JMX collector is that it acquires an MBeanServerConnection, which is an object that can read attributes from MBeans deployed in a JVM, read the values of the target attributes, and trace them with the ITracer API. For this type of collection, a critical decision is where to deploy the collector. The choices are local deployment and remote deployment.

In local deployment, the collector and its invoking scheduler are deployed within the target JVM itself. The JMX collector component then accesses the MXBeans using PlatformMBeanServer, which is a statically accessible MBeanServerConnection inside the JVM. In remote deployment, the collector runs in a separate process and connects to the target JVM using some form of JMX Remoting. This may be less efficient than a local deployment but does not require the deployment of any additional components to the target system. JMX Remoting is beyond this article's scope, but it is easily achieved by deploying a RMIConnectorServer or by simply enabling external attaching in the JVM (see Related topics).

Sample JMX collector

This article's sample JMX collector (see Download for the article's complete source code) contains three separate methods for acquiring an MBeanServerConnection. The collector can:

  • Acquire an MBeanServerConnection to the local JVM's platform MBeanServer using a call to the static method.
  • Acquire an MBeanServerConnection to a secondary MBeanServer deployed locally in the JVM's platform using a call to the static agentId) method. Note that it is possible to have several MBeanServers resident in one JVM, and more complex systems such as Java Platform, Enterprise Edition (Java EE) servers nearly always have an application server specific MBeanServer that is separate and distinct from the platform MBeanServer (see the Cross-registering MBeans sidebar).
  • Acquire a remote MBeanServerConnection through standard RMI remoting using a

Listing 2 is an abbreviated snippet from the JMXCollector collect() method showing the collection and tracing of thread activity from the ThreadMXBean.

Listing 2. Portion of sample JMX collector's collect() method using ThreadMXBean
objectNameCache.put(THREAD_MXBEAN_NAME, new ObjectName(THREAD_MXBEAN_NAME));
public void collect() {
   CompositeData compositeData = null;
   String type = null;
   try {
      log("Starting JMX Collection");
      long start = System.currentTimeMillis();
      ObjectName on = null;
      // Thread Monitoring
      on = objectNameCache.get(THREAD_MXBEAN_NAME);
        hostName, "JMX", on.getKeyProperty("type"), "StartedThreadRate");
      tracer.traceSticky((Integer)jmxServer.getAttribute(on, "ThreadCount"), hostName, 
        "JMX", on.getKeyProperty("type"), "CurrentThreadCount");
      // Done
      long elapsed = System.currentTimeMillis()-start;
      tracer.trace(elapsed, hostName, "JMX", "JMX Collector", 
         "Collection", "Last Elapsed Time");
      tracer.trace(new Date(), hostName, "JMX", "JMX Collector", 
         "Collection", "Last Collection");         
      log("Completed JMX Collection in ", elapsed, " ms.");         
   } catch (Exception e) {
      log("Failed:" + e);
      tracer.traceIncident(hostName, "JMX", "JMX Collector", 
         "Collection", "Collection Errors");

The code in Listing 2 traces the values for TotalThreadsStarted and CurrentThreadCount. Because this is a polling collector, both tracings use the sticky option. But because TotalThreadsStarted is always an increasing number, the most interesting aspect is not the absolute number, but rather the rate at which threads are being created, so that tracer uses the DeltaSticky option.

Figure 7 shows the APM metric tree created by this collector:

Figure 7. JMX collector APM metric tree
JMX collector APM metric tree
JMX collector APM metric tree

The JMX collector has a few aspects not shown in Listing 2 (but which can be seen in the full source code), such as the scheduling registration, which creates a periodic callback to the collect() method every 10 seconds.

In Listing 2, different tracer types and data types are implemented depending on the data source. For example:

  • TotalLoadedClasses and UnloadedClassCount are traced as sticky deltas because the values always rise, and the delta is probably more useful than the absolute value as a means of measuring class-loading activity.
  • ThreadCount is a variable quantity that can increment or decrement, so it is traced as a sticky.
  • Collection Errors is traced as an interval incident, incremented on any exception encountered while taking a collection.

In pursuit of efficiency, because the target MXBeans' JMX ObjectName won't change during the target JVM's lifetime, the collector caches the names using the ManagementFactory constant names.

With two types of MXBeans — GarbageCollector and MemoryPool— the exact ObjectNames might not be known up front, but you can supply a general pattern. In these cases, the first time you make a collection, you issue a query against the MBeanServerConnection and request a list of all the MBeans that match the supplied pattern. To avoid future queries during the target JVM's lifetime, the returned matching MBean ObjectNames are cached.

In some cases, a collection's target MBean attribute might not be a flat numeric type. This is the case with the MemoryMXBean and MemoryPoolMXBean. In these cases, the attribute type is a CompositeData object interrogated for its keys and values. In the case of the JVM management interface, the MXBean standard adopts the model of JMX Open Types, in which all attributes are language-neutral types such as java.lang.Boolean and java.lang.Integer. Or in the case of complex types such as, these types can be decomposed into key/value pairs of the same simple types. The full list of simple types is enumerated in the static field. This model supports a level of type independence so that JMX clients do not have a dependency on nonstandard classes and can also support non-Java clients because of the relative simplicity of the underlying types. For more detail on JMX Open Types, see Related topics.

In cases in which a target MBean attribute is a nonstandard complex type, you need to ensure that the class defining that type is in your collector's classpath. And you must implement some custom code to render the useful data from the retrieved complex object.

In instances in which a single connection is acquired and retained for all collections, error detection and remediation is required to create a new connection in the event of a failure. Some collection APIs provide disconnect listeners that can prompt the collector to close, clean up, and create a new connection. To address scenarios in which a collector tries to connect to a PDS that has been taken down for maintenance or is inaccessible for some other reason, the collector should poll for reconnect on a friendly frequency. Tracking a connection's elapsed time can also be useful in order to degrade the frequency of collections if a slowdown is detected. This can reduce overhead on a target JVM that may be overly taxed for a period of time.

Two additional techniques not implemented in these examples can improve the JMX collector's efficiency and reduce the overhead of running it against the target JVM. The first technique applies in cases in which multiple attributes are being interrogated from one MBean. Rather than requesting one attribute at a time using getAttribute(ObjectName name, String attribute), it is possible to issue a request for multiple attributes in one call using getAttributes(ObjectName name, String[] attributes). The difference might be negligible in local collection but can reduce resource utilization significantly in remote collection by reducing the number of network calls. The second technique is to reduce the polling overhead of the JMX exposed memory pools further by implementing the listening collector pattern instead of a polling pattern. The MemoryPoolMXBean supports the ability to establish a usage threshold that, when exceeded, fires a notification to a listener, which in turn can trace the value. As the memory usage increases, the usage threshold can be increased accordingly. The downside of this approach is that without extremely small increments in the usage threshold, some granularity of data can be lost and patterns of memory usage below the threshold become invisible.

A final unimplemented technique is to measure windows of elapsed time and the total elapsed garbage-collection time and implement some simple arithmetic to derive the percentage of elapsed time that the garbage collector is active. This is a useful metric because some garbage collection is (for the time being) an inevitable fact of life for most applications. Because some number of collections, each lasting some period of time, are to be expected, the percentage of elapsed time when garbage collections are running can put the JVM's memory health in a clearer context. A general rule of thumb (but highly variable by application) is that any more than 10 percent of any 15-minute period indicates a potential issue.

External configuration for collectors

The JMX collector I've outlined in this section is simplified to illustrate the collection process, but it's extremely limiting always to have hard-coded collections. Ideally, a collector implements the data-access how, and an externally supplied configuration supplies the what. Such a design makes collectors much more useful and reusable. For the highest level of reuse, an externally configured collector should support these configuration points:

  • A PDS connection-factory directive to provide the collector the interface to use to connect to the PDS and the configuration to use when connecting.
  • The frequency to collect on.
  • The frequency on which to attempt a reconnect.
  • The target MBean for collection, or a wildcard object name.
  • For each target, the tracing compound name or fragment the measurement should be traced to, and the data type that it should be traced as.

Listing 3 illustrates an external configuration for a JMX collector:

Listing 3. Example of external configuration for a JMX collector
<?xml version="1.0" encoding="UTF-8"?>
   <attribute name="ConnectionFactoryClassName">
   <attribute name="ConnectionFactoryProperties">
   <attribute name="NamePrefix">,JMX</attribute>
   <attribute name="PollFrequency">10000</attribute>
   <attribute name="TargetAttributes">
         <TargetAttribute objectName="java.lang:type=Threading" 
            attributeName="ThreadCount" Category="Threading" 
            metricName="ThreadCount" type="SINT"/>
         <TargetAttribute objectName="java.lang:type=Compilation" 
            attributeName="TotalCompilationTime" Category="Compilation" 
            metricName="TotalCompilationTime" type="SDINT"/>

Note that the TargetAttribute elements contain an attribute called type, which represents a parameterized argument to a smart type tracer. The SINT type represents ticky int, and the SDINT type represents delta sticky int.

Monitoring application resources through JMX

So far, I've examined monitoring only standard JVM resources through JMX. However, many application frameworks, such as Java EE, can expose important application-specific metrics through JMX, depending on the vendor. One classic example is DataSource utilization. A DataSource is a service that pools connections to an external resource (most commonly, a database), limiting the number of concurrent connections to protect the resource from misbehaving or stressed applications. Monitoring data sources is a critical piece of an overall monitoring plan. The process is similar to what you've have already seen, thanks to JMX's abstraction layer.

Here's a list of typical data source metrics taken from a JBoss 4.2 application server instance:

  • Available connection count: The number of connections that are currently available in the pool.
  • Connection count: The number of actual physical connections to the database from connections in the pool.
  • Maximum connections in use: The high-water mark of connections in the pool being in use.
  • In-use connection count: The number of connections currently in use.
  • Connections-created count: The total number of connections created for this pool.
  • Connections destroyed count: The total number of connections destroyed for this pool.

This time, the collector uses batch attribute retrieval and acquires all the attributes in one call. The only caveat here is the need to interrogate the returned data to switch on the different data and tracer types. DataSource metrics are also pretty flat without any activity, so to see some movement in the numbers, you need to generate some load. Listing 4 shows the DataSource collector's collect() method:

Listing 4. The DataSource collector
public void collect() {
   try {
      log("Starting DataSource Collection");
      long start = System.currentTimeMillis();
      ObjectName on = objectNameCache.get("DS_OBJ_NAME");
      AttributeList attributes  = jmxServer.getAttributes(on, new String[]{
      for(Attribute attribute: (List<Attribute>)attributes) {
            || attribute.getName().equals("ConnectionDestroyedCount")) {
               tracer.traceDeltaSticky((Integer)attribute.getValue(), hostName, 
               "DataSource", on.getKeyProperty("name"), attribute.getName());
         } else {
            if(attribute.getValue() instanceof Long) {
               tracer.traceSticky((Long)attribute.getValue(), hostName, "DataSource", 
                  on.getKeyProperty("name"), attribute.getName());
            } else {
               tracer.traceSticky((Integer)attribute.getValue(), hostName, 
                  "DataSource",on.getKeyProperty("name"), attribute.getName());
      // Done
      long elapsed = System.currentTimeMillis()-start;
      tracer.trace(elapsed, hostName, "DataSource", "DataSource Collector", 
         "Collection", "Last Elapsed Time");
      tracer.trace(new Date(), hostName, "DataSource", "DataSource Collector", 
         "Collection", "Last Collection");         
      log("Completed DataSource Collection in ", elapsed, " ms.");         
   } catch (Exception e) {
      log("Failed:" + e);
      tracer.traceIncident(hostName, "DataSource", "DataSource Collector", 
         "Collection", "Collection Errors");

Figure 8 shows the corresponding metric tree for the DataSource collector:

Figure 8. The DataSource collector metric tree
The DataSource collector metric tree

Monitoring components in the JVM

This section addresses techniques that can be used to monitor application components, services, classes, and methods. The primary areas of interest are:

  • Invocation rate: The rate at which a service or method is being invoked.
  • Invocation response rate: The rate at which a service or method responds.
  • Invocation error rate: The rate at which a service or method generates errors.
  • Invocation elapsed time: The average, minimum, and maximum elapsed time for an invocation per interval.
  • Invocation concurrency: The number of threads of execution concurrently invoking a service or method.

Using metrics made available by some implementations of the Java SE 5 (and newer) ThreadMXBean, it is also possible to collect the following metrics:

  • System and user CPU time: The elapsed CPU time consumed invoking a method.
  • Number of waits and total wait time: The number of instances and total elapsed time when the thread was waiting while invoking a method or service. Waits occur when a thread enters a wait state of WAITING or TIMED_WAITING pending another thread's activity.
  • Number of blocks and total block time: The number of instances and total elapsed time when the thread was in a BLOCKED state while invoking a method or service. Blocks occur when a thread is waiting for a monitor lock to enter or reenter a synchronized block.

These metrics, and others, can also be determined using alternative tool sets and native interfaces, but this usually involves some level of overhead that makes them undesirable for production run-time monitoring. Having said that, the metrics themselves, even when collected, are low level. They may not be useful for anything other than trending, and they are quite difficult to correlate with any causal effects that can't be identified through other means.

All of the above metrics can be collected by a process of instrumenting the classes and methods of interest to make them collect and trace the performance data to the target APM system. A number of techniques can be used to instrument Java classes directly or to derive performance metrics from them indirectly:

  • Source code instrumentation: The most basic technique is to add instrumentation at the source code phase so that the compiled and deployed classes already contain the instrumentation at run time. In some cases, it makes sense to do this, and certain practices make it a tolerable process and investment.
  • Interception: By diverting an invocation through an interceptor that performs the measurement and tracing, it is possible to monitor accurately and efficiently without touching the targeted classes, their source code, or their run-time bytecode. This practice is quite accessible because many Java EE frameworks and other popular Java frameworks:
    • Favor abstraction through configuration.
    • Enable class injection and referencing through interfaces.
    • In some cases directly support the concept of an interception stack. The flow of execution passes through a configuration-defined stack of objects whose purpose and design is to accept an invocation, do something with it, and then pass it on.
  • Bytecode instrumentation: This is the process of injecting bytecode into the application classes. The injected bytecode adds performance-data-collecting instrumentation that is invoked as part and parcel of what is essentially a new class. This process can be highly efficient because the instrumentation is fully compiled bytecode, and the code's execution path is extended in about as small a way possible while still collecting data. It also has the virtue of not requiring any modification to the original source code, and potentially minimal configuration change to the environment. Moreover, the general pattern and techniques of bytecode injection allow the instrumentation of classes and libraries for which source code is not unavailable, as is the case with many third-party classes.
  • Class wrapping: This is the process of wrapping or replacing a target class with another class that implements the same functionality but also contains instrumentation.

Here in Part 1, I address only source code based instrumentation; you'll read more about interception, bytecode instrumentation, and class wrapping in Part 2. (Interception, bytecode instrumentation, and class wrapping are virtually identical from a topological perspective, but the action to achieve the result has slightly different implications in each case.)

Asynchronous instrumentation

Asynchronous instrumentation is a fundamental issue in class instrumentation. A previous section explored the concepts of polling for performance data. If polling is done reasonably well, it should have no impact on the core application performance or overhead. In contrast, instrumenting the application code itself directly modifies and affects the core code's execution. The primary goal of any sort of instrumentation must be Above all, do no harm. The overhead penalty must be as close to negligible as possible. There is virtually no way to eliminate an extremely small execution penalty in the measurement itself, but once the performance data has been acquired, it is critical that the remainder of the trace process be asynchronous. There are several patterns for implementing asynchronous tracing. Figure 9 illustrates a general overview of how it can be done:

Figure 9. Asynchronous tracing
Asynchronous tracing
Asynchronous tracing

Figure 9 illustrates a simple instrumentation interceptor that measures the elapsed time of an invocation by capturing its start time and end time, and then dispatches the measurement (the elapsed time and the metric compound name) to a processing queue. The queue is then read by a thread pool, which acquires the measurement and completes the trace process.

Java class instrumentation through source code

This section addresses the subject of implementing source level instrumentation and provides some best practices and example source code. It also introduces some new tracing constructs that I'll detail in the context of source code instrumentation to clarify their actions and their implementation patterns.

Despite the prevalence of alternatives, instrumentation of source code is unavoidable in some instances; in some cases it's the only solution. With sensible precautions, it's not necessarily a bad one. Considerations include:

  • If the option to instrument source code is available, and you're prohibited from implementing configuration changes to effect instrumentation more orthogonally, the use of a configurable and flexible tracing API is critical.
  • An abstracted tracing API is analogous to a logging API such as log4j, with these attributes in common:
    • Runtime verbosity control: The verbosity level of log4j loggers and appenders can be configured at start time and then modified at run time. Similarly, a tracing API should be able to control which metric names are enabled for tracing based on a hierarchical naming pattern.
    • Output endpoint configuration: log4j issues logging statements through loggers, which in turn are dispatched to appenders that can be configured to send the log stream to a variety of outputs such as files, sockets, and e-mail. The tracing API does not require this level of output diversity, but the ability to abstract a proprietary or APM system-specific library protects the source code from change through external configuration.
  • In some circumstances, it might not be possible to trace a specific item through any other means. This is typical in cases I refer to as contextual tracing. I use this term to describe performance data that is not of primary importance but adds context to the primary data.

Contextual tracing

Contextual tracing is highly subjective to the specific application, but consider the simplified example of a payroll-processing class with a processPayroll(long clientId) method. When invoked, the method calculates and stores the paycheck for each of the client's employees. You can probably instrument the method through various means, but an underlying pattern in the execution clearly indicates that the invocation time increases disproportionately with the number of employees. Consequently, examining a trend of elapsed times for processPayroll has no context unless you know how many employees are in each run. More simply put, for a given period of time the average elapsed time of processPayroll was x milliseconds. You can't be sure if that value indicates acceptable or poor performance because if the window comprised only one employee, you would perceive it as poor, but if it comprised 150 employees, you'd think it was flying. Listing 5 displays this simplified concept in code:

Listing 5. A case for contextual tracing
public void processPayroll(long clientId) {
   Collection<Employee> employees = null;
   // Acquire the collection of employees
   // Process each employee
   for(Employee emp: employees) {
      processEmployee(emp.getEmployeeId(), clientId);

The primary challenge here is that by most instrumentation techniques, anything inside the processPayroll() method is untouchable. So although you might be able to instrument processPayroll and even processEmployee, you have no way of tracing the number of employees to provide context to the method's performance data. Listing 6 displays a poorly hardcoded (and somewhat inefficient) example of how to capture the contextual data in question:

Listing 6. Contextual tracing example
public void processPayrollContextual(long clientId) {      
   Collection<Employee> employees = null;
   // Acquire the collection of employees
   employees = popEmployees();
   // Process each employee
   int empCount = 0;
   String rangeName = null;
   long start = System.currentTimeMillis();
   for(Employee emp: employees) {
      processEmployee(emp.getEmployeeId(), clientId);
   rangeName = tracer.lookupRange("Payroll Processing", empCount);
   long elapsed = System.currentTimeMillis()-start;
   tracer.trace(elapsed, "Payroll Processing", rangeName, "Elapsed Time (ms)");
   tracer.traceIncident("Payroll Processing", rangeName, "Payrolls Processed");
   log("Processed Client with " + empCount + " employees.");

The key part of Listing 6 is the call to tracer.lookupRange. Ranges are named collections that are keyed by a numerical range limit and have a String value representing the name of the numerical range. Instead of tracing a payroll process's simple flat elapsed times, Listing 6 demarcates employee counts into ranges, effectively separating out elapsed times and grouping them by roughly similar employee counts. Figure 10 displays the metric tree generated by the APM system:

Figure 10: Payroll-processing times grouped by range
Payroll processing times grouped by range

Figure 11 illustrates the elapsed times of the payroll processing demarcated by employee counts, revealing the relative relationship between the number of employees and the elapsed time:

Figure 11. Payroll-processing elapsed times by range
Payroll processing elapsed times by range
Payroll processing elapsed times by range

The tracer configuration properties allow the option of including a URL to a properties file where ranges and thresholds can be defined. (I'll cover thresholds shortly.) The properties are read in at tracer construction time and provide the backing data for the tracer.lookupRange implementation. Listing 7 shows an example configuration of the Payroll Processing range. I have elected to use the XML representation of java.util.Properties because it is more forgiving of oddball characters.

Listing 7. Sample range configuration
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "">
   <comment>Payroll Process Range</comment>
   <entry key="L:Payroll Processing">181+ Emps,10:1-10 Emps,50:11-50 Emps,
      80:51-80 Emps,120:81-120 Emps,180:121-180 Emps</entry>

The injection of externally defined ranges protects your application from the need to update constantly at a source-code level because of adjusted expectations or business-driven changes to service level agreements (SLAs). As ranges and thresholds changes take effect, you are only required to update the external file, not the application itself.

Tracking thresholds and SLAs

The flexibility of externally configurable contextual tracing enables a more accurate and granular way to define and measure performance thresholds. While a range defines a series of numerical windows within which a measurement can be categorized, a threshold is a further categorization on a range that grades the acquired measurement in accordance with a measurement's determined range. A common requirement for the analysis of collected performance data is the determination and reporting of "successful" executions vs. executions that are considered "failed" because they did not occur within a specified time. The aggregation of this data can be required as a general report card on a system's operational health and capacity or as some form of SLA compliance assessment.

Using the payroll-processing system example, consider an internal service-level goal that defines execution times of payrolls (within the defined employee count ranges) into bands of Ok, Warn, and Critical. The process of generating threshold counts is conceptually simple. You just need to provide the tracers the values you consider to be the upper elapsed time of each group for each band and direct the tracer to issue a tracer.traceIncident for the categorized elapsed time, and then — to simplify reporting — a total. Table 2 outlines some contrived SLA elapsed times:

Table 2. Payroll-processing thresholds
Employee CountOk (ms)Warn (ms)Critical (ms)

The ITracer API implements threshold-reporting using values defined in the same XML (properties) file as the ranges we explored. Range and threshold definitions differ slightly in two ways. First, the key value for a threshold definition is a regular expression. When ITracer traces a numeric value, it checks to see if a threshold regular expression matches the compound name of the metric being traced. If it matches, the threshold can then grade the measurement as Ok, Warn, or Critical, and an additional tracer.traceIncident is piggybacked on the trace. Second, because thresholds define only two values (a Critical value is defined as being greater than a warn value), the configuration consists of simply two numbers. Listing 8 shows the threshold configuration for the payroll-process SLA I outlined previously:

Listing 8. The threshold configuration for payroll process
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "">
   <!-- Payroll Processing Thresholds -->
   <entry key="Payroll Processing.*81-120 Emps.*Elapsed Time \(ms\)">1100,1500</entry>   
   <entry key="Payroll Processing.*1-10 Emps.*Elapsed Time \(ms\)">280,400</entry>   
   <entry key="Payroll Processing.*11-50 Emps.*Elapsed Time \(ms\)">850,1200</entry>   
   <entry key="Payroll Processing.*51-80 Emps.*Elapsed Time \(ms\)">900,1100</entry>      
   <entry key="Payroll Processing.*121-180 Emps.*Elapsed Time \(ms\)">1400,2000</entry>   
   <entry key="Payroll Processing.*181\+ Emps.*Elapsed Time \(ms\)">2000,3000</entry>   

Figure 12 shows the metric tree for payroll processing with the added threshold metrics:

Figure 12. Payroll processing metric tree with thresholds
Payroll processing metric tree with thresholds

Figure 13 illustrates what the data collected can represent in the form of a pie chart:

Figure 13. SLA summary for payroll processing (1 to 10 employees)
SLA summary for payroll processing
SLA summary for payroll processing

It is important to ensure that lookups for contextual and threshold categorization are as efficient and as fast as possible because they are being executed in the same thread that is doing the actual work. In the ITracer implementation, all metric names are stored into (thread-safe) maps designated for metrics with and without designated thresholds the first time they are seen by the tracer. After the first trace event for a given metric, the elapsed time for the determination of the threshold (or lack of one) is a Map lookup time, which is typically fast enough. In cases where the number of threshold entries or the number of distinct metric names is extremely high, a reasonable solution would be to defer the threshold determination and have it handled in the asynchronous tracing thread-pool worker.

Conclusion to Part 1

This first article in the series has presented some monitoring antipatterns as well as some desirable attributes of an APM system. I've summarized some general performance data collection patterns and introduced the ITracer interface, which I'll continue to use for the rest of the series. I've demonstrated techniques for monitoring the health of a JVM and general performance data acquisition through JMX. Lastly, I summarized ways you can implement efficient and code-change-resistant source-level instrumentation that monitors raw performance statistics and contextual derived statistics, and how these statistics can be used to report on application SLAs. Part 2 will explore techniques for instrumenting Java systems without modifying the application source code, by using interception, class wrapping, and dynamic bytecode instrumentation.

Go to Part 2 now.

Downloadable resources

Related topics

Zone=Java development
ArticleTitle=Java run-time monitoring, Part 1: Run-time performance and availability monitoring for Java systems