Java run-time monitoring, Part 3

Monitoring performance and availability of an application's ecosystem

Monitoring hosts, databases, and messaging; management and visualization of performance data


Content series:

This content is part # of # in the series: Java run-time monitoring, Part 3

Stay tuned for additional content in this series.

This content is part of the series:Java run-time monitoring, Part 3

Stay tuned for additional content in this series.

In Part 1 and Part 2 of this three-article series, I presented techniques and patterns for monitoring Java applications, with a focus on the JVM and the application classes. In this final installment, I widen the focus to present techniques for gathering performance and availability data from the application's dependencies, such as the underlying operating system, the network, or the application's backing operational database. I'll conclude with a discussion of gathered-data management patterns, and methods for data reporting and visualization.

Spring-based collectors

In Part 2, I implemented a basic Spring-based component model for managing monitoring services. The rationale for and benefits of this model are:

  • The XML-based configuration eases the management of the larger sets of parameters needed to configure more complex performance data collectors.
  • A separation of concerns structure allows for simpler components that can interact with one another through Spring's dependency injection.
  • Spring provides simple collection beans a life cycle consisting of initialization, start, and stop operations and the option of exposing a Java Management Extensions (JMX) management interface to the beans so they can be controlled, monitored, and troubleshot at run time.

I'll cover more details of the Spring-based collectors in each of this article's sections, as they become applicable.

Monitoring hosts and operating system

Java applications always run on underlying hardware and an operating system that supports the JVM. A critical component in a comprehensive monitoring infrastructure is the capability to gather performance, health, and availability metrics from the hardware and OS — typically through the OS. This section covers some techniques for acquiring this data and tracing it to an application performance management (APM) system through the ITracer class introduced in Part 1.

Typical OS performance metrics

The following summary lists typical metrics that are relevant across a wide cross-section of OSs. Although the details of data collection can vary significantly, and the interpretation of the data must be considered in the context of the specific OS, these metrics are broadly equivalent on most standard hosts:

  • CPU utilization: This represents how busy the CPUs are on a given host. The unit is typically the percent utilization, which at a low level indicates the time that a CPU was busy as a percentage of a specific period of elapsed clock time. Hosts can have multiple CPUs, and CPUs can contain multiple cores, but multiple cores are usually abstracted out by most OSs to represent a CPU each. For example, a two-CPU host with dual-core CPUs would be represented as four CPUs. Metrics can usually be gathered per CPU or as a total resource utilization, which represents the aggregate utilization for all processors. The need to monitor each individual CPU or the aggregate is generally determined by the nature of the software and its internal architecture. A standard multithreaded Java application typically balances the load across all available CPUs by default, so an aggregate is acceptable. However, in some cases, individual OS processes are "pinned" to a specific CPU, and aggregate metrics might not capture an appropriate level of granularity.

    CPU utilization is typically broken down into four categories:

    • System: Processor time spent executing system- or OS-kernel-level activity
    • User: Processor time spent executing user activity
    • I/O Wait: Processor time spent idle waiting for an I/O request to complete
    • Idle: Implicitly, the absence of any processor activity
    Two additional related metrics are the run queue length — essentially the backlog of requests awaiting CPU time — and context switches, which are instances of switching processor time allocation from one process to another.
  • Memory: The simplest memory metrics are the percentage of physical memory available or in use. Additional considerations relate to virtual memory, the rate of memory allocation and deallocation, and more granular metrics regarding which specific areas of memory are being used.
  • Disk and I/O: Disk metrics are the simple (but highly critical) reporting of disk space available or in use per logical or physical disk device as well as the rates of reads and writes against these devices.
  • Network: This is the rate of data transfer and errors on network interfaces, typically broken out into high-level network-protocol categorizations such as TCP and IP.
  • Processes and process groups: All of the preceding metrics can be represented as the total activity for a given host. They can also be broken out into the same metrics but representative of consumption or activity by an individual process or related group of processes. Monitoring resource utilization by process helps to interpret the proportions of resources being consumed by each application or service on a host. Some applications instantiate only one process; in other cases, a service such as an Apache 2 Web Server can instantiate a pool of processes that together represent one logical service.

Agent versus agentless

Different OSs have different mechanisms by which performance data can be accessed. I'll present a number of ways that data can be collected, but a common distinction you are likely to come across in the field of monitoring is the contrast between agent-based and agentless monitoring. The implication is that in some cases data can be collected without a specific installation of additional software on a target host. But clearly an agent of some sort is always involved, inasmuch as monitoring always requires an interface that data must be read through. The real distinction here is between using an agent that is typically always present in a given OS — such as SSH on a Linux® server — and installing additional software that exists for the sole purpose of monitoring and making the collected data available to an external collector. Both approaches involve trade-offs:

  • Agents require additional software installations and may require periodic maintenance patches to be applied. In environments with a large number of hosts, the software-management effort can significantly discourage use of agents.
  • If the agent is physically part of the same process as the application, or even if it is a separate process, a failure of the agent's process will blind the monitoring. Although the host itself may still be running and healthy, the APM must assume it is down because the agent cannot be reached.
  • A local agent installed on a host may have significantly better data-collection and event-listening capabilities than an agentless remote monitor. Furthermore, the reporting of aggregate metrics may require the collection of several raw underlying metrics that would be inefficient if executed remotely. A local agent can efficiently gather data, aggregate it, and make the aggregated data available to the remote monitor.

Ultimately, an optimal solution may be to implement both agentless and agent-based monitoring, with a local agent that's responsible for collecting the bulk of metrics and a remote monitor that checks basics such as the server's up state and the local agent's status.

Agents can also have different options. An autonomous agent collects data on its own schedule, in contrast to a responding agent that delivers data on request. And some agents simply supply data to requesters, whereas others trace data directly or indirectly to the APM system.

Next I'll present techniques for monitoring hosts with Linux and UNIX® OSs.

Monitoring Linux and UNIX hosts

Monitoring agents are available that implement specialized native libraries to collect performance data from Linux and UNIX OSs. But Linux and most UNIX variants have a rich set of built-in data-collection tools that make the data accessible through a virtual file system called /proc. The files appear to be common text files in an ordinary file system directory, but they are actually in-memory data structures that are abstracted through the facade of a text file. Because this data is easily read and parsed by a number of standard command-line utilities or custom tools, the files tend to be simple to use and either very general or very specific in their output. They tend to perform extremely well because they are essentially plucking data directly out of memory.

Common tools used to extract performance data from /proc are ps, sar, iostat, and vmstat (see Related topics for reference documentation on these tools). As a result, an effective way to monitor Linux and UNIX hosts is simply to execute shell commands and parse the responses. Similar monitors can be used across a wide variety of Linux and UNIX implementations; although they all might differ slightly, it is trivial to format the data in a way that makes the collection procedure completely reusable. In contrast, specialized native libraries may need to be recoded or rebuilt for each Linux and UNIX distribution. (It's likely they are reading the same /proc data anyway.) And writing custom shell commands that can do specialized monitoring for a specific case, or that standardize the format of returned data, is simple and incurs low overhead.

Now I'll demonstrate several methods for invoking shell commands and tracing the returned data.

Shell-command execution

To execute data-collection monitoring on a Linux host, you must invoke a shell. It can be bash, csh, ksh, or any other supported shell that allows invocation of the target script or command and retrieves the output. The most common options are:

  • Local shell: If you have a JVM running on the target host, a thread can access the shell through a call to java.lang.Process.
  • Remote Telnet or rsh: Both of these services allow the invocation of a shell and shell commands, but their relatively low security has seen their use diminish. They are disabled by default on most contemporary distributions.
  • Secure Shell (SSH): SSH is the most commonly used remote shell. It offers full access to the Linux shell and is generally considered secure. This is the primary mechanism I'll use in the article's shell-based examples. SSH services are available on a wide variety of OSs, including virtually all flavors of UNIX, Microsoft® Windows®, OS/400, and z/OS.

Figure 1 shows the conceptual difference between a local shell and a remote shell:

Figure 1. Local and remote shells
Local and remote shells

A small amount of setup is required to initiate an unattended SSH session with a server. You must create an SSH keypair consisting of a private key and a public key. The contents of the public key are placed on the target server, and the private key is placed on the remote monitoring server where the data collector can access it. Once this is done, the data collector can supply the private key and the private key's passphrase and access a secure remote shell on the target server. The target account's password is not required and is superfluous when you use a keypair. The setup steps are:

  1. Make sure that the target host has an entry in your local known-hosts file. This is a file that lists known IP addresses or names and the associated SSH public key recognized for each. At a user level, this file is typically the ~/.ssh/known_hosts file in the user's home directory.
  2. Connect to the target server using the monitoring account (for example, monitoruser).
  3. Create a subdirectory called .ssh in the home directory.
  4. Change directory into the .ssh directory and issue the ssh-keygen -t dsa command. The command prompts you for a key name and a passphrase. Two files are then generated called monitoruser_dsa (the private key) and (the public key).
  5. Copy the private key to a secure location accessible from where the data collectors will run.
  6. Append the public key contents to a file in the .ssh directory called authorized_keys using the command cat >> authorized_keys.

Listing 1 shows the process I've just outlined:

Listing 1. Creating an SSH key pair
whitehen@whitehen-desktop:~$ mkdir .ssh
whitehen@whitehen-desktop:~$ cd .ssh
whitehen@whitehen-desktop:~/.ssh$ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/whitehen/.ssh/id_dsa): whitehen_dsa
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in whitehen_dsa.
Your public key has been saved in
The key fingerprint is:
46:cd:d4:e4:b1:28:d0:41:f3:ea:3b:8a:74:cb:57:e5 whitehen@whitehen-desktop
whitehen@whitehen-desktop:~/.ssh$ cat >> authorized_keys

The data collector can now make an SSH connection to the target Linux host, called whitehen-desktop, which is running Ubuntu Linux.

The data collection for this example will be accomplished using a generic collector class called An instance of this class will deployed in a Spring context under the name UbuntuDesktopRemoteShellCollector. However, some remaining dependencies are required to complete the whole process:

  • A scheduler is required to invoke the collector once a minute. This is accomplished by an instance of java.util.concurrent.ScheduledThreadPoolExeutor, which provides both a scheduled callback mechanism and a thread pool. It will be deployed in Spring under the name CollectionScheduler.
  • An SSH shell implementation is required to invoke commands against the server and return the results. This is provided by an instance of This class, an implementation of a shell interface called, will be deployed in Spring under the name UbuntuDesktopRemoteShell.
  • Rather than hard-coding a set of commands and their associated parsing routines, the collector uses an instance of that will be deployed in Spring under the name UbuntuDesktopCommandSet. The command set is loaded from an XML document that describes:
    • The target platform the shell will be executed against
    • The commands that will be executed
    • How the data returned will be parsed and mapped to APM tracing namespaces
    I'll provide more detail on these definitions shortly. Figure 2 outlines the basic relationship among the collector, the shell, and the command set:
Figure 2. The collector, shell, and command set
The collector, shell, and command set
The collector, shell, and command set

Now I'll drill down into some brief examples of specific performance-data-producing commands and how to configure them. A classic example is the sar command. The definition of sar from the Linux man page (see Related topics) is Collect, report, or save system activity information. The command is quite flexible, with more than 20 arguments that can be used in combination. A simple option is to call sar -u 1 3, which reports CPU utilization measured over three intervals of one second each. Listing 2 shows the output:

Listing 2. Example sar command output
whitehen@whitehen-desktop:~$ sar -u 1 3
Linux 2.6.22-14-generic (whitehen-desktop)      06/02/2008

06:53:24 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
06:53:25 PM     all      0.00      0.00      0.00      0.00      0.00    100.00
06:53:26 PM     all      0.00     35.71      0.00      0.00      0.00     64.29
06:53:27 PM     all      0.00     20.79      0.99      0.00      0.00     78.22
Average:        all      0.00     18.73      0.33      0.00      0.00     80.94

The output can be broken down into preamble, a header, the three interval data readings, and a readings summary average. The goal here is to execute this shell command, capture the output, parse it, and trace to the APM system. The format is simple enough, but output formats can vary (slightly or greatly) by version, and other sar options return completely different data (not to mention that other commands return different formats). Listing 3, for example, shows sar execution displaying active socket activity:

Listing 3. sar displaying socket activity
whitehen@whitehen-desktop:~$ sar -n SOCK 1
Linux 2.6.22-14-generic (whitehen-desktop)      06/02/2008

06:55:10 PM    totsck    tcpsck    udpsck    rawsck   ip-frag
06:55:11 PM       453         7         9         0         0
Average:          453         7         9         0         0

Consequently, what's needed is a solution that lets you configure different formats rapidly without needing to recode the collectors. It is also helpful to be able to translate cryptic words such as totsck to more readable phrases like Total Used Sockets before the collected traces hit the APM system.

In some cases, you may have the option of acquiring this data in XML format. For example, the sadf command in the SysStat package (see Related topics) generates much of the typically collected Linux monitoring data in XML. The XML format lends much more predictability and structure to the data and virtually eliminates the tasks of parsing, mapping data to tracing namespaces, and decryptifying obscure words. However, these tools are not always available for shell-accessible systems you may want to monitor, so a flexible text parsing and mapping solution is invaluable.

Following up on the two preceding examples of sar usage, I'll now present an example of setting up the salient Spring bean definitions to monitor this data. All the examples referenced are included in this article's sample code (see Download).

First, the main entry point for the SpringCollector implementation is org.runtimemonitoring.spring.collectors.SpringCollector. It takes one argument: the directory name where the Spring bean configuration files live. SpringCollector loads any files with a .xml extension and treats them as bean descriptors. This directory is the ./spring-collectors directory in the root of the project. (Later in the article, I will outline all the files in this directory. Multiple files are optional, and all the definitions could be bundled up into one, but keeping them separated by notional functionality helps keep things a bit better organized.) The three bean definitions in this example represent the shell collector, the shell, and the command set. The descriptors are shown in Listing 4:

Listing 4. Bean descriptors for a shell collector, shell, and command set
<!-- The Collector -->
<bean id="UbuntuDesktopRemoteShellCollector"
  <property name="shell" ref="UbuntuDesktopRemoteShell"/>
  <property name="commandSet" ref="UbuntuDesktopCommandSet"/>
  <property name="scheduler" ref="CollectionScheduler"/>
  <property name="tracingNameSpace" value="Hosts,Linux,whitehen-desktop"/>
  <property name="frequency" value="5000"/>

<!-- The Shell -->
<bean id="UbuntuDesktopRemoteShell"
   <property name="userName" value="whitehen"/>
   <property name="hostName" value="whitehen-desktop"/>
   <property name="port" value="22"/>
   <property name="knownHostsFile"
      value="C:/Documents and Settings/whitehen/.ssh/known_hosts"/>
   <property name="privateKey"
   <property name="passphrase" value="Hello World"/>

<!-- The CommandSet -->
<bean id="UbuntuDesktopCommandSet"
   <constructor-arg type=""

The CommandSet bean in Listing 4 has only an id (UbuntuDesktopCommandSet) and a URL to another XML file. This is because the command sets are quite large, and I did not want to clutter the Spring files with them. I will describe the CommandSet shortly.

The first bean in Listing 3 is UbuntuDesktopRemoteShellCollector. Its bean id values are purely arbitrary and descriptive, although they do need to be consistent when this bean is referenced from another bean. The class in this case is, which is a generic class for collecting data through a shell-like interface. The other salient properties are:

  • shell: The instance of the shell class that the collector will use to invoke and retrieve data from shell commands. Spring injects the instance of the shell with the bean id of UbuntuDesktopCommandSet.
  • commandSet: The CommandSet instance that represents a set of commands and associated parsing and tracing namespace mapping directives. Spring injects the instance of the command set with the bean id of UbuntuDesktopRemoteShell.
  • scheduler: A reference to a scheduling thread pool that manages the scheduling of the collections of data and allocates a thread to do the job.
  • tracingNameSpace: The tracing namespace prefix that controls where in the APM tree these metrics will be traced to.
  • frequency: The frequency of the data collections, in milliseconds.

The second bean in Listing 4 is the shell, which is an implementation of an SSH shell called The class is implemented using JSch from (see Related topics). Its other salient properties are:

  • userName: The name of the user to connect to the Linux server as.
  • hostName: The name (or IP address) of the Linux server to connect to.
  • port: The Linux server port on which sshd is listening.
  • knownHostFile: A file containing the host names and SSH certificates of SSH servers that are "known" to the local host where the SSH client is running. (This security mechanism in SSH is an interesting reversal of traditional security hierarchies whereby the client may not trust the host and will not connect unless the host is "known" and presents a matching certificate.)
  • privateKey: The SSH private-key file that is used to authenticate to the SSH server.
  • passPhrase: The passphrase that is used to unlock the private key. This has the appearance of being just like a password, except that it is not transmitted to the server and is only used to decrypt the private key locally.

Listing 5 shows CommandSet's internals:

Listing 5. CommandSet internals
<CommandSet name="UbuntuDesktop">
         <shellCommand>sar -u 1</shellCommand>
          <paragraph id="1" name="CPU Utilization"/>
          <columns entryName="1" values="2-7" offset="1">
          <tracers default="SINT"/>
         <shellCommand>sar -n SOCK 1</shellCommand>
          <paragraph id="1" name="Socket Activity"/>
          <columns values="1-5" offset="1">
             <namemapping from="ip-frag" to="IP Fragments"/>
             <namemapping from="rawsck" to="Raw Sockets"/>
             <namemapping from="tcpsck" to="TCP Sockets"/>
             <namemapping from="totsck" to="Total Sockets"/>
             <namemapping from="udpsck" to="UDP Sockets"/>
          <tracers default="SINT"/>

The CommandSet is responsible for managing shell commands and parsing directives. Because each Linux or UNIX system will have slightly different output, even for the same commands, there would typically be one CommandSet for each unique host type being monitored. A detailed description of every single option in the XML behind a CommandSet would take too long, because it is constantly evolving and being tweaked for new situations, but the following is a brief overview of some of the tags:

  • <shellCommand>: Defines the actual command that will be passed to the shell.
  • <paragraphSplitter>: Some commands, or commands comprising multiple chained commands, may return multiple sections of text as a result. These are referred to as paragraphs. The regular expression (regex) defined here specifies how the paragraphs are demarcated. The command object splits the result into multiple paragraphs and passes the requested paragraph to the underlying extractors.
  • <Extractors> and the <CommandResultExtract> tags that they contain: These constructs define the parsing and mapping.
  • <paragraph>: The extractor defines which paragraph it wants from the result using the zero-based index in the id attribute, and all traced metrics from the paragraph fall under the tracing names spaced defined in the paragraph name.
  • <columns>: If an entryName is defined, then the indexed column in each row is added to the tracing namespace. This is for cases where a left-side column contains a metric demarcation. For example, one option for sar will report the CPU utilization for each individual CPU, and the CPU number is listed in the second column. In Listing 5, the entryName extracts the all qualifier, indicating that the report is an aggregate summary for all CPUs. The values attribute represents which columns in each row should be traced, and offset accounts for any imbalance between the number of columns in a data row and the corresponding header.
  • <tracers>: Defines the default tracing type and allows different tracer types to be defined for values associated with a named header or entryName.
  • <filterLine>: If defined, the regex ignores data rows where the full line of text does not match.
  • <lineSplit>: Defines the split regex used to parse lines each paragraph.

Figure 3 shows the APM tree for this example:

Figure 3. APM tree for Ubuntu Desktop monitoring
APM Tree for Ubuntu Desktop monitoring
APM Tree for Ubuntu Desktop monitoring

If you don't like the look of this tree, you have other options. The command that is sent to the server can easily be modified to pipe through a series of grep, awk, and sed commands to reformat the data to a form that requires far less parsing. For example, see Listing 6:

Listing 6. Formatting command output within the command
whitehen@whitehen-desktop:~$ sar -u 1 | grep Average | \ 
   awk '{print "SINT/User:"$3"/System:"$5"/IOWait:"$6}'

Another option that provides an optimal combination of configuration, flexibility, and performance is the use of dynamic scripting, especially in cases where additional formatting tools may not be available and the output format is particularly awkward. In the next example, I have configured a Telnet shell to collect load-balancing status data from a Cisco CSS load balancer. The output format and content is particularly problematic for any sort of standardized parsing, and the shell supports a limited set of commands. Listing 7 shows the command's output:

Listing 7. Output from a CSS Telnet command
Service Name                     State     Conn  Weight  Avg   State
                                                         Load  Transitions

ecommerce1_ssl                   Alive         0      1   255            0
ecommerce2_ssl                   Down          0      1   255            0
admin1_ssl                       Alive         0      1     2         2982
admin2_ssl                       Down          0      1   255            0
clientweb_ssl                    Alive         0      1   255            0

Listing 8 shows the command set used to execute the command and parse. Note the <preFormatter beanName="FormatCSSServiceResult"/> tag. This is a reference to a Spring bean that contains a few lines of Groovy script. The raw output of the Telnet shell command is passed to the Groovy script, and the return value is then passed to the command data extractor in a much more friendly format. Also note that the tracer type is overridden to a STRING type for the value in the column labeled Status. Keen-eyed observers will note that there is no column with that name, but part of the Groovy script's job is to fix the fact that there appear to be two columns called State (and you can see why), so the script renames the first one to Status.

Listing 8. The CSS CommandSet
<CommandSet name="CiscoCSS">
         <shellCommand>show service summary</shellCommand>
       <preFormatter beanName="FormatCSSServiceResult"/>
             <paragraph id="0" name="Service Summary" header="true"/>
        <columns entryName="0" values="1-5" offset="0"/>
        <tracers default="SINT">
           <tracer type="STRING">Status</tracer>

The Groovy bean has a number of benefits. The script is dynamically configurable so it can be changed at run time. The bean detects that the source has changed and calls the Groovy compiler on the next invocation, so the performance is adequate. The language is also rich in parsing functionality and simple to write. Listing 9 shows the Groovy bean that contains the text of the source code inline:

Listing 9. A Groovy formatting bean
<bean id="FormatCSSServiceResult"
   init-method="init" lazy-init="false">
   <property name="sourceCode"><value><![CDATA[
      String[] lines = formatTarget.toString().split("\r\r\n");
      StringBuffer buff = new StringBuffer();
      lines.each() {
            it.contains("Load  Transitions") ||
            it.contains("show service summary") ||
            it.trim().length() < 1)) {
      return buff.toString()
      .replaceFirst("State", "Status")
      .replaceFirst("Service Name", "ServiceName")
      .replace("State", "Transitions");

Figure 4 shows the APM metric tree for the CSS monitoring:

Figure 4. APM tree for CSS monitoring
APM Tree for CSS Monitoring

SSH connections

One last consideration for Linux/UNIX shell collecting is the issue of SSH connections. The basic interface of all the shell classes is It defines two variations of a method called issueOSCommand(), in which the command is passed as a parameter and the result is returned. In my example using the remote SSH class, the underlying shell invocation is based on the implementation of the SSHEXEC task in Apache Ant (see Related topics). The advantage of the technique used is that it is simple, but it has a decided downside: A new connection is made for every command that is issued. This is obviously inefficient. A remote shell might only be polled every few minutes, but each polling cycle can execute several commands to acquire the appropriate range of monitoring data. The challenge is that maintaining an open session for the duration of a monitoring window (across multiple polling cycles) is tricky. It requires much more detailed inspection of returned data and parsing to account for different shell types and the constant appearance of the shell prompt, which of course is not part of your expected return value.

I have been working on a long-lived session shell implementation. The alternative is to compromise: maintain a one connection per polling cycle pattern, but try to capture all the data in one command. This can be done by appending commands or, in some cases, using multiple options against one command. For example, the version of sar on my SuSE Linux servers has a -A option that returns a sampling of all the supported metrics of sar; this command is the equivalent of sar -bBcdqrRuvwWy -I SUM -n FULL -P ALL. The returned data will have multiple paragraphs, but you should have no issue parsing it with a command set. For an example of this, see the command-set definition in this article's sample code named Suse9LinuxEnterpriseServer.xml (see Download).

Monitoring Windows

Performance-data collection is no exception to the substantial difference between Microsoft Windows and Linux/UNIX. Windows has virtually no native command-line tools that provide a comparable wealth of performance-reporting data. Nor is performance data accessible through anything like the relatively simple /proc filesystem. The Windows Performance Manager (WPM) — also referred to as SysMon, System Monitor, or Performance Monitor — is the standard interface for acquiring performance measurements from Windows hosts. It is quite powerful and rich in useful metrics. Furthermore, many Windows-based software packages publish their own metrics through WPM. Windows also provides charting, reporting, and alerting facilities through WPM. Figure 5 shows a screenshot of a WPM instance:

Figure 5. Windows Performance Manager
Windows Performance Manager
Windows Performance Manager

WPM manages a set of performance counters, which are compound-named objects that reference a specific metric. The components of the compound name are:

  • Performance object: The general category of performance metric, such as Processor or Memory.
  • Instance: Some performance objects are demarcated by an instance when there are multiple possible members. For example, Processor has instances that represent each individual CPU and a summary total instance. In contrast, Memory is a "flat" performance object, because memory has only one manifestation.
  • Counter: The granular name of the metric within the instance (if applicable) and performance object. For example, the Processor instance 0 has a counter called % Idle Time.

Based on these name segments, the naming convention and syntax for expressing these objects is:

  • With instance: \performance object(instance name)\counter name
  • Without instance: \performance object\counter name

WPM's significant downside is that it can be challenging to access this data, especially remotely, and critically challenging from non-Windows platforms. I will present a number of techniques for capturing WPM data using ITracer-based collectors. Here's a brief summary of the major options:

  • Log-file reading: WPM can be configured to log all collected metrics to a log file, which can then be read, parsed, and traced.
  • Database query: WPM can be configured to log all collected metrics to an SQL Server database, where they can be read, parsed, and traced.
  • Win32 API: Clients written using Win32 APIs (.NET, C++, Visual Basic, and so on) can connect directly to WPM using WPM's API.
  • Custom agent: A custom agent can be installed on the target Windows server that can act as a proxy for external requests for WPM data from non-Windows clients.
  • Simple Network Management Protocol (SNMP): SNMP is an instance of an agent that bears greater emphasis on account of its virtual ubiquity in its ability to monitor devices, hosts, and so on. I'll discuss SNMP later in this article.
  • WinRM: WinRM is the Windows implementation of the WS-Management specification, which outlines the use of Web services for system management. Because Web services are language and platform-neutral, this certainly provides non-Windows clients access to WPM metrics. Although this can be considered another form of an agent, it will become standard in Windows 2008, casting it into the arena of agentless solutions. More interestingly, Java Specification Request 262 (Web Services Connector for JMX Agent) promises to interact directly with Windows-based, WS-Management services.

In the examples that follow, I'll present a theoretical proof of concept using a local Windows shell and an agent implementation.

Local Windows shell

As a simple proof of concept, I've created a Windows command-line executable in C# called winsar.exe. Its intent is to provide some of the same command-line access to performance statistics as the Linux/UNIX sar command. The syntax for the command line usage is simple: winsar.exe Category Counter Raw Instance.

The instance name is mandatory unless the counter is not an instance counter and can be all (*). The counter name is mandatory but can be all (*). Raw is true or false. Listing 10 displays example uses for an instance-based counter and a non-instance-based counter:

Listing 10. winsar examples with non-instance- and instance-based counters
C:\NetProjects\WinSar\bin\Debug>winsar Memory "% Committed Bytes In Use" false
C:\NetProjects\WinSar\bin\Debug>winsar LogicalDisk "Current Disk Queue Length" false C:
C:  2

Based on my intention to recreate something like sar, the data output is in a rough (nonformatted) tabular form so it can be parsed using a standard shell command set. For instance-based counters, the instances are in the first column of the data lines, with the counter names across the header line. For non-instance-based counters, there are no names in the first field of the data lines. For parsing clarity, any names with spaces are filled with "-" characters. The result is fairly ugly but easily parsed.

Setting up a collector for these statistics (which are abbreviated for presentation) is fairly straightforward. The shell implementation is a, and the command sets reference the winsar.exe and arguments. The shell can also be implemented as a remote shell using SSH, which requires the installation of an SSH server on the target Windows host. However, this solution is highly inefficient, primarily because the implementation is .NET-based; it's not efficient to start up a Common Language Runtime (CLR) for such a brief period on a repeating basis.

Another solution might be to rewrite winsar in native C++. I'll leave that to Windows programming experts. The .NET solution can be made efficient, but the program must remain running as a background process, servicing requests for WPM data through some other means and not terminating after every request. In pursuit of this, I implemented a second option in winsar in which an argument of -service starts the program, reads in a configuration file called winsar.exe.config, and listens for requests over a Java Message Service (JMS) topic. The contents of the file are fairly self-explanatory except for a couple of items. The jmsAssembly item refers to the name of a .NET assembly containing a .NET version of the JBoss 4.2.2 client libraries that are supplying the JMS functionality. This assembly was created using IKVM (see Related topics). The respondTopic references the name of the public topic where responses are published, rather than using a private topic, so that other listeners can receive the data as well. The commandSet is a reference to the command set that should be used by the generic receiver to parse and trace the data. Listing 11 shows the winsar.exe.config file:

Listing 11. The winsar.exe.config file
      <add key="java.naming.factory.initial"
      <add key="java.naming.factory.url.pkgs"
      <add key="java.naming.provider.url" value=""/>
      <add key="connectionFactory" value="ConnectionFactory"/>
      <add key="listenTopic" value="topic/StatRequest"/>
      <add key="respondTopic" value="topic/StatResponse"/>
      <add key="jmsAssembly" value="JBossClient422g" />
      <add key="commandSet" value="WindowsServiceCommandSet" />

Implementing the collector in Spring to use this service is conceptually similar to setting up the shells. In fact, the collector itself is an extension of called The difference is that this shell acts like an ordinary collector and issues requests for data, but the data is received through JMS, parsed, and traced by another component. The shell implemented, called, behaves like a shell but dispatches the command through JMS, as illustrated in Figure 6:

Figure 6. Delegating collector
Delegating collector
Delegating collector

Because this looks like a good strategy for deploying agents across the board, the same JMS-based agent is implemented in Java code and can be deployed on any JVM-supporting OS. A JMS publish/subscribe performance data collection system is illustrated in Figure 7:

Figure 7. JMS publish/subscribe monitoring
JMS pub/sub monitoring
JMS pub/sub monitoring

A further distinction can be drawn with respect to how the JMS agents function. The pattern illustrated in this example exhibits a request listening agent on the target hosts, in that the agents perform no activity after they are started until they receive a request from the central monitoring system. However, these agents could act autonomously by collecting data and publishing it to the same JMS server on their own schedule. However, the advantage of the listening agents is twofold. First, the collection parameters can be configured and maintained in one central location rather than reaching out to each target host. Second (although not implemented in this example), because the central requesting monitor is sending out requests, the monitor can trigger an alert condition if a specific known server does not respond. Figure 8 displays the APM tree for the combined servers:

Figure 8. APM tree for Windows and Linux servers
Tree for Windows and Linux servers
Tree for Windows and Linux servers

winsar is a simple and early prototype with several shortcomings, which include:

  • Programmatic access to some WPM counters (such as the Processor object) produces empty or raw metrics. So it's impossible to read a metric such as CPU % Utilization directly. What is required is a means of taking more than one reading over a defined period of time, after which the CPU utilization can be calculated. winsar does not contain this functionality, but similar agents such as such as NSClient and NC_Net (see Related topics) do provide for this.
  • Admittedly, using JMS as the transport for remote agents, while having some elegance, is limiting. NSClient and NC_Net both use a low-level but simple socket protocol to request and receive data. One of the original intents for these services was to provide Windows data to Nagios, a network-monitoring system almost exclusive to the Linux platform, so realistically there could be no Win32 APIs in the picture from the client side.

Finally, as I mentioned before, the SpringCollector application bootstraps with a single parameter, which is the directory containing the configuration XML files. This directory is /conf/spring-collectors in the root of the sample code package. The specific files used in the preceding examples are:

  • shell-collectors.jmx: Contains the definitions for all the shell collectors.
  • management.xml: Contains JMX management beans and the collection scheduler.
  • commandsets.xml: Contains the definitions for the shell collector's command sets. These reference external XML files in /commands.
  • shells.xml: Contains the definitions for all the shells.
  • jms.xml: Contains the definitions for the JMS connection factory and topics and a Java Naming and Directory Interface (JNDI) context.
  • groovy.xml: Contains the Groovy formatter bean.

This concludes my discussion of OS monitoring. Next, I'll cover the monitoring of database systems.

Monitoring database systems using JDBC

I have frequently encountered cultures that feel strongly that monitoring the database is the exclusive domain of the DBAs and their tools and applications. However, in pursuit of a nonsiloed and centralized APM repository for performance and availability data, it makes sense to complement the DBA efforts with some amount of monitoring from a consolidated APM. I will demonstrate some techniques using a Spring collector called JDBCCollector to gather data that in some cases is most likely not being monitored elsewhere and can usefully add to your arsenal of metrics.

The general categories of data collection that you should consider are:

  • Simple availability and response time: This is a simple mechanism to make a periodic connection to a database, issue one simple query and trace the response time, or trace a server-down metric if the connection fails. A failure in the connection may not necessarily indicate that the database is experiencing a hard down, but it clearly demonstrates, at the very least, communication issues from the application side. Siloed database monitoring may never indicate a database connectivity issue, but it is useful to remember that just because you can connect to a service from there does not mean that you can connect from here.
  • Contextual data: Revisiting the concept of contextual tracing from Part 1, you can leverage some useful information from periodic sampling of your application data. In many cases, there is a strong correlation between patterns of data activity in your database and the behavior or performance of your application infrastructure.
  • Database performance tables: Many databases expose internal performance and load metrics as tables or views, or through stored procedures. As such, the data is easily accessible through JDBC. This area clearly overlaps with traditional DBA monitoring, but database performance can usually be correlated so tightly to application performance that it is a colossal waste to collect these two sets of metrics and not correlate them through a consolidated system.

The JDBCCollector is fairly simple. At base, it is a query and series of mapping statements that define how the results of the queries are mapped to a tracing name space. Consider the SQL in Listing 12:

Listing 12. A sample SQL query
SELECT schemaname, relname, 
SUM(seq_scan) as seq_scan, 
SUM(seq_tup_read) as seq_tup_read
FROM pg_stat_all_tables
where schemaname = 'public'
and seq_scan > 0
group by schemaname, relname
order by seq_tup_read desc

The query selects four columns from a table. I want to map each row returned from the query to a tracing namespace that consists in part of data in each row. Keep in mind that a namespace is made up of a series of segment names followed by a metric name. The mapping is defined by specifying these values using literals, row tokens, or both. A row token represents the value in a numbered column, such as {2}. When the segments and metric name are processed, literals are left as is while tokens are dynamically substituted with the respective column's value from the current row in the result of the query, as illustrated in Figure 9:

Figure 9. JDBCCollector mapping
JDBCCollector Mapping
JDBCCollector Mapping

In Figure 9, I'm representing a one-row response to the query, but the mapping procedure occurs once for each mapping defined for each row returned. The segment's value is {1},{2}, so the segment portion of the tracing namespace is {"public", "sales_order"}. The metric names are literals, so they stay the same, and the metric value is defined as 1 in the first mapping, and 2 in the second, representing 3459 and 16722637, respectively. A concrete implementation should clarify further.

Contextual tracing with JDBC

The application data in your operational database probably has useful and interesting contextual data. The application data itself is not necessarily performance-related itself, but when it's sampled and correlated with historical metrics that represent the performance of your Java classes, JVM health, and your server performance statistics, you can draw a clear picture of what the system was actually doing during a specific period. As a contrived example, consider you are monitoring an extremely busy e-commerce Web site. Orders placed by your customers are logged in a table called sales_order along with a unique ID and the timestamp of when the order was placed. By sampling the number of records entered in the last n minutes, you can derive a rate at which orders are being submitted.

This is another place where the ITracer's delta capabilities are useful, because I can set up a JDBCCollector to query the number of rows since a specific time, and simply trace that value as a delta. The result is a metric (probably among many others) that depicts how busy your site is. This also becomes a valuable historical reference. For example, if you know that when the rate of incoming orders hits 50 per cycle, your database starts slowing down. Hard and specific empirical data eases the process of capacity and growth planning.

Now I'll implement this example. The JDBCCollector uses the same scheduler bean as the prior examples, and it also has a JDBC DataSource defined that's identical to the ones I covered in Part 2. These database collectors are defined in the /conf/spring-collectors/jdbc-collectors.xml file. Listing 13 shows the first collector:

Listing 13. Order-fulfillment rate collector
<bean name="OrderFulfilmentRateLast5s"
   <property name="dataSource" ref="RuntimeDataSource" />
   <property name="scheduler" ref="CollectionScheduler" />
   <property name="query">
   select count(*) from sales_order
   where order_date > ? and order_date < ?
   <property name="logErrors" value="true" />
   <property name="tracingNameSpace" value="Database Context" />
   <property name="objectName" 
      value="org.runtime.db:name=OrderFulfilmentRateLast5s,type=JDBCCollector" />
   <property name="frequency" value="5000" />
   <property name="binds">
         <entry><key><value>1</value></key><ref bean="RelativeTime"/></entry>
         <entry><key><value>2</value></key><ref bean="CurrentTime"/></entry>
   <property name="queryMaps">
         <bean class="org.runtimemonitoring.spring.collectors.jdbc.QueryMap">
            <property name="valueColumn" value="0"/>
            <property name="segments" value="Sales Order Activity"/>
            <property name="metricName" value="Order Rate"/>
            <property name="metricType" value="SINT"/>

<bean name="RelativeTime" 
   <property name="period" value="-5000" />

<bean name="CurrentTime" 
   <property name="period" value="10" />

The collector bean name in this case is OrderFulfilmentRateLast5s, and the class is org.runtimemonitoring.spring.collectors.jdbc.JDBCCollector. The standard scheduler collector is injected in, as is a reference to the JDBC DataSource, RuntimeDataSource. The query defines the SQL that will be executed. SQL queries can either use literals as parameters, or, as in this example, use bind variables. This example is somewhat contrived because the two values for order_date could easily be expressed in SQL syntax, but typically a bind variable would be used when some external value needed to be supplied.

To provide the capability to bind external values, I need to implement the org.runtimemonitoring.spring.collectors.jdbc.IBindVariableProvider interface and then implement that class as a Spring-managed bean. In this case, I am using two instances of org.runtimemonitoring.spring.collectors.jdbc.RelativeTimeStampProvider, a bean that supplies a current timestamp offset by the passed period property. These beans are RelativeTime, which returns the current time minus 5 seconds, and CurrentTime, which returns "now" plus 10 milliseconds. References to these beans are injected into the collector bean through the binds property, which is a map. It is critical that the value of each entry in the map match the bind variable in the SQL statement it is intended for; otherwise, errors or unexpected results can occur.

In effect, I am using these bind variables to capture the number of sales orders entered into the system in approximately the last five seconds. This is quite a lot of querying against a production table, so the frequency of the collection and the window of time (that is, the period of Relative) should be adjusted to avoid exerting an uncomfortable load on the database. To assist in adjusting these settings correctly, the collector traces the collection time to the APM system so the elapsed time can be used as a measure of the query's overhead. More-advanced implementations of the collector could decay the frequency of collection as the elapsed time of the monitoring query increases.

The mapping I presented above is defined through the queryMaps property using an inner bean of the type org.runtimemonitoring.spring.collectors.jdbc.QueryMap. It has four simple properties:

  • valueColumn: The zero-based index of the column in each row that should be bound as the tracing value. In this case, I am binding the value of count(*).
  • segments: The tracing namespace segment, which is defined as a single literal.
  • metricName: The tracing namespace of the metric name, also defined as a literal.
  • metricType: The ITracer metric type, which is defined as a sticky int.

The collector allows for multiple queryMaps to be defined per collector on the basis that you might want to trace more than one value from each executed query. The next example I'll show you uses rowTokens to inject values from the returned data into the tracing namespace, but the current example uses literals. However, to contrive an example using the same query, I could change the query to select count(*), 'Sales Order Activity', 'Order Rate' from sales_order where order_date > ? and order_date < ?. This makes my desired segment and metric names return in the query. To map them I can configure segments to be {1} and metricName to be {2}. In some stretch cases, the metricType might even come from the database, and that value can be represented by a rowToken too. Figure 10 displays the APM tree for these collected metrics:

Figure 10. Sales-order rate monitoring
Sales order rate monitoring
Sales order rate monitoring

Database performance monitoring

Using the same process, the JDBCCollector can acquire and trace performance data from database performance views. In the case of this example, which uses PostgreSQL, these tables — referred to as statistics views — have names prefixed with pg_stat. Many other databases have similar views and can be accessed with JDBC. For this example, I'll use the same busy e-commerce site and set up a JDBCCollector to monitor the table and index activity on the top five busiest tables. The exact SQL for this is shown in Listing 14:

Listing 14. Table and index activity monitor
<bean name="DatabaseIO"
   <property name="dataSource" ref="RuntimeDataSource" />
   <property name="scheduler" ref="CollectionScheduler" />
   <property name="availabilityNameSpace"
      value="Database,Database Availability,Postgres,Runtime" />
   <property name="query">
   SELECT schemaname, relname, SUM(seq_scan) as seq_scan,
   SUM(seq_tup_read) as seq_tup_read,
   SUM(idx_scan) as idx_scan, SUM(idx_tup_fetch) as idx_tup_fetch,
   COALESCE(idx_tup_fetch,0) + seq_tup_read
      + seq_scan + COALESCE(idx_scan, 0) as total
   FROM pg_stat_all_tables
   where schemaname = 'public'
   and (COALESCE(idx_tup_fetch,0) + seq_tup_read
      + seq_scan + COALESCE(idx_scan, 0)) > 0
   group by schemaname, relname, idx_tup_fetch,
      seq_tup_read, seq_scan, idx_scan
   order by total desc
   LIMIT 5 ]]>
   <property name="tracingNameSpace"
     value="Database,Database Performance,Postgres,Runtime,Objects,{beanName}"
   <property name="frequency" value="20000" />
   <property name="queryMaps">
    <bean class="org.runtimemonitoring.spring.collectors.jdbc.QueryMap">
      <property name="valueColumn" value="2"/>
      <property name="segments" value="{0},{1}"/>
      <property name="metricName" value="SequentialScans"/>
      <property name="metricType" value="SDLONG"/>
    <bean class="org.runtimemonitoring.spring.collectors.jdbc.QueryMap">
      <property name="valueColumn" value="3"/>
      <property name="segments" value="{0},{1}"/>
      <property name="metricName" value="SequentialTupleReads"/>
      <property name="metricType" value="SDLONG"/>
    <bean class="org.runtimemonitoring.spring.collectors.jdbc.QueryMap">
      <property name="valueColumn" value="4"/>
      <property name="segments" value="{0},{1}"/>
      <property name="metricName" value="IndexScans"/>
      <property name="metricType" value="SDLONG"/>
    <bean class="org.runtimemonitoring.spring.collectors.jdbc.QueryMap">
      <property name="valueColumn" value="5"/>
      <property name="segments" value="{0},{1}"/>
      <property name="metricName" value="IndexTupleReads"/>
      <property name="metricType" value="SDLONG"/>

The query retrieves the following values every 20 seconds for the top 5 busiest tables:

  • The name of the database schema
  • The name of the table
  • The total number of sequential scans
  • The total number of tuples retrieved by sequential scans
  • The total number of index scans
  • The total number of tuples retrieved by index scans

The last four columns are all perpetually increasing values, so I'm using a metric type of SDLONG, which is a sticky delta long. Note that in Listing 14 I have configured four QueryMaps to map these four columns into a tracing namespace.

In this scenario, I have contrived a useful example without creating an index on the sales_order table. Consequently, monitoring will reveal a high number of sequential scans (also referred to as table scans in database parlance), which is an inefficient mechanism for retrieving data because it reads every row in the table. The same applies to the sequential tuple reads — basically the number of rows that have been read using a sequential scan. There is a significant distinction between rows and tuples, but it is not relevant in this situation. You can refer to the PostgreSQL documentation site for clarification (see Related topics). Looking at these statistics in the APM display, it is clear that my database is missing an index. This is illustrated in Figure 11:

Figure 11. Sequential reads
Sequential Reads
Sequential Reads

As soon as I notice this, I fire off a couple of SQL statements to index the table. The result shortly thereafter is that the sequential operations both drop down to zero, and the index operations that were zero are now active. This is illustrated in Figure 12:

Figure 12. After the index
After the Index
After the Index

The creation of the index has a rippling effect through the system. One of the other metrics that jumps out as visibly settled down is the User CPU % on the database host. This is illustrated in Figure 13:

Figure 13. CPU after the index
CPU After the Index
CPU After the Index

Database availability

The last JDBC aspect I'll address is the relatively simple one of availability. Database availability in its simplest form is an option in the standard JDBCCollector. If the collector is configured with a availabilityNameSpace value, the collector will trace two metrics to the configured namespace:

  • Availability: A 1 if the database could be connected to and a 0 if it could not
  • Connect time: The elapsed time consumed to acquire a connection

The connect time is usually extremely fast when a data source or connection pool is being used to acquire a connection. But most JDBC connection-pooling systems can execute a configurable SQL statement before the connection is handed out, so the test is legitimate. And under heavy load, connection acquisition can have a nonzero elapsed time. Alternatively, a separate data source can be set up for a JDBCCollector that is dedicated to availability testing. This separate data source can be configured not to pool connections at all, so every poll cycle initiates a new connection. Figure 14 displays the availability-check APM tree for my PostgreSQL runtime database. Refer to Listing 14 for an example of the use of the availabilityNameSpace property.

Figure 14. The runtime database availability check
The runtime database availability check

I have seen situations in which the determination of a specific status requires multiple chained queries. For example, a final status requires a query against Database A but requires parameters that can only be determined by a query against Database B. This condition can be handled with two JDBCCollectors with the following special considerations:

  • The chronologically first query (against Database B) is configured to be inert in that it has no schedule. (A collection frequency of zero means no schedule.) The instance of the JDBCCollector also implements IBindVariableProvider, meaning that it can provide bind variables and binding to another collector.
  • The second collector defines the first collector as a bind that will bind in the results of the first query.

This concludes my discussion of database monitoring. I should add that this section has focused on database monitoring specifically through the JDBC interface. Complete monitoring of a typical database should also include monitoring of the OS the database resides on, the individual or groups of database processes, and some coverage of the network resources where necessary to access the database services.

Monitoring JMS and messaging systems

This section describes techniques to monitor the health and performance of a messaging service. Messaging services such as those that implement JMS — also referred to as message-oriented middleware (MOM) — play a crucial role in many applications. They require monitoring, like any other application dependency. Frequently, messaging services provide asynchronous, or "fire-and-forget," invocation points. Monitoring these points can be slightly more challenging because from many perspectives, the service can appear to be performing well, with calls to the service being dispatched frequently and with very low elapsed times. What can remain concealed is an upstream bottleneck where messages are being forwarded to their next destination either very slowly, or not at all.

Because most messaging services exist either within a JVM, or as one or more native processes on a host (or group of hosts), the monitoring points include some of the same points as for any targeted service. These might include standard JVM JMX attributes; monitoring resources on the supporting host; network responsiveness; and characteristics of the service's processes, such as memory size and CPU utilization.

I'll outline four categories of messaging-service monitoring, three of them being specific to JMS and one relating to a proprietary API:

  • In order to measure the throughput performance of a service, a collector periodically sends a group of synthetic test messages to the service and then waits for their return delivery. The elapsed time of the send, receive, and total round trip elapsed time is measured and traced, along with any failures or timeouts.
  • Many Java-based JMS products expose metrics and monitoring points through JMX, so I will briefly revisit implementing a JMX monitor using the Spring collector.
  • Some messaging services provide a proprietary API for messaging-broker management. These APIs typically include the ability to extract performance metrics of the running service.
  • In the absence of any of the preceding options, some useful metrics can be retrieved using standard JMS constructs such as a javax.jms.QueueBrowser.

Monitoring messaging services through synthetic test messages

The premise of synthetic messages is to schedule the sending and receiving of test messages to a target messaging service and measure the elapsed time of the messages' send, receive, and total round trip. To contrive the message's return, and also potentially to measure the response time of message delivery from a remote location, an optimal solution is to deploy a remote agent whose exclusive task is to:

  1. Listen for the central monitor's test messages
  2. Receive them
  3. Add a timestamp to each received message
  4. Resend them back to the messaging service for delivery back to the central monitor

The central monitor can then analyze the returned message and derive elapsed times for each hop in the process and trace them to the APM system. This is illustrated in Figure 15:

Figure 15. Synthetic messages
Synthetic messages
Synthetic messages

Although this approach covers the most monitoring ground, it does have some downsides:

  • It requires the deployment and management of a remote agent.
  • It requires the creation of additional queues on the messaging services for the test message transmission.
  • Some messaging services allow for the dynamic creation of first-class queues on the fly, but many require manual queue creation through an administrative interface or through a management API.

An alternative that is outlined here as specific to JMS (but may have equivalents in other messaging systems) is the use of a temporary queue or topic. A temporary queue can be created on the fly through the standard JMS API, so no administrative intervention is required. These temporary constructs have the added advantage of being invisible to all other JMS participants except the originating creator.

In this scenario, I'll use a JMSCollector that creates a temporary queue on startup. When prompted by the scheduler, its sends a number of test messages to the temporary queue on the target JMS service and then receives them back again. This effectively tests the throughput on the JMS server and does not require the creation of concrete queues or the deployment of a remote agent. This is illustrated in Figure 16:

Figure 16. Synthetic messages with a temporary queue
Synthetic messages with a temporary queue

The Spring collector class for this scenario is org.runtimemonitoring.spring.collectors.jms.JMSCollector. The configuration dependencies are fairly straightforward, and most of the dependencies are already set up from previous examples. The JMS connectivity requires a JMS javax.jms.ConnectionFactory. I acquire this using the same Spring bean that was defined to acquire a JMS connection factory in the Windows WPM collection example. As a recap, this required one instance of a Spring bean of type org.springframework.jndi.JndiTemplate that provides a JNDI connection to my target JMS service, and one instance of a Spring bean of type org.springframework.jndi.JndiObjectFactoryBean that uses the JNDI connection to lookup the JMS connection factory.

To provide some flexibility in the makeup of the synthetic message payload, the JMSCollector is configured with a collection of implementations of an interface called org.runtimemonitoring.spring.collectors.jms.ISyntheticMessageFactory. Objects that implement this interface provide an array of test messages. The collector calls each configured factory and executes the round-trip test using the supplied messages. In this way, I can test throughput on my JMS service with payloads that vary by message size and message count.

Each ISyntheticMessageFactory has a configurable and arbitrary name that's used by the JMSCollector to add to the tracing name space. The full configuration is shown in Listing 15:

Listing 15. Synthetic message JMSCollector
<!-- The JNDI Provider -->
<bean id="jbossJndiTemplate" class="org.springframework.jndi.JndiTemplate">
   <property name="environment"><props>
      <prop key="java.naming.factory.initial">
      <prop key="java.naming.provider.url">
      <prop key="java.naming.factory.url.pkgs">

<!-- The JMS Connection Factory Provider -->
<bean id="RealJMSConnectionFactory"
   <property name="jndiTemplate" ref="jbossJndiTemplate" />
   <property name="jndiName" value="ConnectionFactory" />

<!-- A Set of Synthetic Message Factories -->
<bean id="MessageFactories" class="java.util.HashSet">
         <property name="name" value="MapMessage"/>
         <property name="messageCount" value="10"/>
         <constructor-arg type=""
         <property name="name" value="ByteMessage"/>
         <property name="messageCount" value="1"/>

<!-- The JMS Collector -->
<bean id="LocalJMSSyntheticMessageCollector"
   <property name="scheduler" ref="CollectionScheduler" />
   <property name="logErrors" value="true" />
   <property name="tracingNameSpace" value="JMS,Local,Synthetic Messages" />
   <property name="frequency" value="5000" />
   <property name="messageTimeOut" value="10000" />
   <property name="initialDelay" value="3000" />
   <property name="messageFactories" ref="MessageFactories"/>
   <property name="queueConnectionFactory" ref="RealJMSConnectionFactory"/>

The two message factories implemented in Listing 15 are:

  • A javax.jms.MapMessage factory that loads the current JVM's system properties into the payload of each message and is configured to send 10 messages per cycle
  • A javax.jms.ByteMessage factory that loads the bytes from a JAR file into the payload of each message and is configured to send 10 messages per cycle

Figure 17 displays the APM tree for the synthetic-message monitoring. Note that the size of the byte payload is appended to the end of the javax.jms.ByteMessage message factory name.

Figure 17. APM tree for synthetic messages with a temporary queue
APM Tree for synthetic messages with a temporary queue
APM Tree for synthetic messages with a temporary queue

Monitoring messaging services through JMX

Messaging services such as JBossMQ and ActiveMQ expose a management interface through JMX. I introduced JMX-based monitoring in Part 1. I'll briefly revisit it now and introduce the Spring collector based on the org.runtimemonitoring.spring.collectors.jmx.JMXCollector class and how it can be used to monitor a JBossMQ instance. Because JMX is a constant standard, the same process can be used to monitor any JMX-exposed metrics and is widely applicable.

The JMXCollector has two dependencies:

  • A for the local JBossMQ is provided by the bean named LocalRMIAdaptor. In this case, the connection is acquired by issuing a JNDI lookup for a JBoss org.jboss.jmx.adaptor.rmi.RMIAdaptor. Other providers are usually trivial to acquire, assuming that any applicable authentication credentials can be supplied, and the Spring package supplies a number of factory beans to acquire different implementations of MBeanServerConnections (see Related topics).
  • A profile of JMX collection attributes packaged in a collection bean containing instances of org.runtimemonitoring.spring.collectors.jmx.JMXCollections. These are directives to the JMXCollector about which attributes to collect.

The JMXCollection class exhibits some attributes common to JMX monitors. The basic configuration properties are:

  • targetObjectName: This is the full JMX ObjectName name of the MBean that is targeted for collection, but it can also be a wildcard. The collector queries the JMX agent for all MBeans matching the wildcard pattern and then collects data from each one.
  • segments: The segments of the APM tracing namespace where the collected metrics are traced.
  • metricNames: An array of metric names that each MBean attribute should be mapped to, or a single * character that directs the collector to use the attribute name provided by the MBean.
  • attributeNames: An array of MBean attribute names that should be collected from each targeted MBean.
  • metricTypes or defaultMetricType: An array of metric types that should be used for each attribute, or one single metric type that should be applied to all attributes.

The MBean ObjectName wildcarding is a powerful option because is effectively implements discovery of monitoring targets rather than needing to configure the monitor for each individual target. In the case of JMS queues, JBossMQ creates a separate MBean for each queue, so if I want to monitor the number of messages in each queue (referred to as the queue depth) I can simply specify a general wildcard such as,* that all instances of JMS queue MBeans will be collected from. However, I have the additional challenge of dynamically figuring out what the queue name is, because these objects are discovered on the fly. In this case, I know that the value of the MBean's ObjectName name property of the discovered MBeans is the name of the queue. For example, a discovered MBean might have an object name of,name=MyQueue. Accordingly, I need a way of mapping properties from discovered objects to tracing namespaces in order to demarcate traced metrics from each source. This is achieved using tokens in similar fashion to the rowToken in the JDBCCollector. The supported tokens in the JMXCollector are:

  • {target-property:name}: The token is substituted with the named property from the target MBean's ObjectName. Example: {target-property:name}.
  • {this-property:name}: The token is substituted with the named property from the collector's ObjectName. Example: {this-property:type}.
  • {target-domain:index}: The token is substituted with the indexed segment of the target MBean's ObjectName domain. Example: {target-domain:2}.
  • {this-domain:index}: The token is substituted with the indexed segment of the collector's ObjectName domain. Example: {target-domain:0}.

Listing 16 shows the abbreviated XML configuration for the JBossMQ JMXCollector:

Listing 16. Local JBossMQ JMXCollector
<!-- The JBoss RMI MBeanServerConnection Provider -->
<bean id="LocalRMIAdaptor"
   <property name="jndiTemplate" ref="jbossJmxJndiTemplate" />
   <property name="jndiName" value="jmx/invoker/RMIAdaptor" />

<!-- The JBossMQ JMXCollection Profile -->
<bean id="StandardJBossJMSProfile"
   init-method="init" >
      <bean class="org.runtimemonitoring.spring.collectors.jmx.JMXCollection">
         <property name="targetObjectName" value="*:service=Queue,*"/>
         <property name="segments" value="Destinations,Queues,{target-property:name}"/>
         <property name="metricNames" value="*"/>
         <property name="attributeNames"
         <property name="defaultMetricType" value="SINT"/>
       <bean class="org.runtimemonitoring.spring.collectors.jmx.JMXCollection">
         <property name="targetObjectName" value=""/>
         <property name="segments" value="Destniations,{target-property:service}"/>
         <property name="metricNames" value="*"/>
         <property name="attributeNames" value="ClientCount"/>
         <property name="defaultMetricType" value="SINT"/>
       <!-- MBeans Also Included: Topics, ThreadPool, MessageCache -->

<!-- The JMXCollector for local JBoss MQ Server -->
<bean id="LocalJBossCollector"
   <property name="server" ref="LocalRMIAdaptor" />
   <property name="scheduler" ref="CollectionScheduler" />
   <property name="logErrors" value="true" />
   <property name="tracingNameSpace" value="JMS,Local" />
   <property name="objectName"
      value="org.runtime.jms:name=JMSQueueMonitor,type=JMXCollector" />
   <property name="frequency" value="10000" />
   <property name="initialDelay" value="10" />
   <property name="jmxCollections" ref="StandardJBossJMSProfile"/>

Figure 18 displays the APM tree for the JBossMQ server's JMS queues monitored using the JMXCollector:

Figure 18. APM tree for JMX monitoring of JBossMQ queues
APM tree for JMX monitoring of JBossMQ queues
APM tree for JMX monitoring of JBossMQ queues

Monitoring JMS queues using queue browsers

In the absence of an adequate management API for monitoring your JMS queues, it is possible to use a javax.jms.QueueBrowser. A queue browser behaves almost exactly like a javax.jms.QueueReceiver, except that acquired messages are not removed from the queue and are still delivered once retrieved by the browser. Queue depth is typically an important metric. It is commonly observed that in many messaging systems, message producers outpace message consumers. The severity of that imbalance can be viewed in the number of messages being queued in the broker. Consequently, if queue depths cannot be accessed in any other way, using a queue browser is a last resort. The technique has a number of downsides. In order to count the number of messages in a queue, the queue browser must retrieve every message in the queue (and then discard them). This is highly inefficient and will have a much higher elapsed time to collect than using a management API — and probably take a higher toll on the JMS server's resources. An additional aspect of queue browsing is that for busy systems, the count will most likely be wrong by the time the browse is complete. Having said that, for the purposes of monitoring an approximation is probably acceptable, and in a highly loaded system even a highly accurate measurement of a queue depth at any given instant will be obsolete in the next instant anyway.

Queue browsing has one benefit: In the course of browsing a queue's messages, the age of the oldest message can be determined. This is a difficult metric to come by, even with the best JMS-management APIs, and in some cases it can be a critical monitoring point. Consider a JMS queue used in the transmission of critical messages. The message producer and message consumer have typical differentials, and the pattern of traffic is such that a standard poll of the queue depth typically shows one or two messages. Ordinarily, this is due to a small amount of latency, but with a polling frequency of one minute, the messages in the queue are not the same messages from poll to poll. Or are they? They might not be the same ones, in which case the situation is normal. But it could be that both the message producer and message consumer simultaneously failed, and the couple of messages being observed in the queue are the same messages every single poll. In this scenario, monitoring the age of the oldest message in parallel with the queue depth makes the condition clear: normally the message ages are less than a few seconds, but if a double failure in the producer/consumer occurs, it will only take the time between two polling cycles for conspicuous data to start emerging from the APM.

This functionality is demonstrated in the Spring collector's org.runtimemonitoring.spring.collectors.jmx.JMSBrowserCollector. Its two additional configuration properties are a javax.jms.ConnectionFactory, just like the JMSCollector, and a collection of queues to browse. The configuration for this collector is shown in Listing 17:

Listing 17. Local JBossMQ JMSBrowserCollector
<!-- A collection of Queues to browse -->
<bean id="BrowserMonitorQueues" class="java.util.HashSet">
         <bean id="QueueA"
            <property name="jndiTemplate" ref="jbossJndiTemplate" />
            <property name="jndiName" value="queue/A" />
               <bean id="QueueB"
            <property name="jndiTemplate" ref="jbossJndiTemplate" />
            <property name="jndiName" value="queue/B" />

<!-- the JMS Queue Browser -->
<bean id="LocalQueueBrowserCollector"
   <property name="scheduler" ref="CollectionScheduler" />
   <property name="logErrors" value="true" />
   <property name="tracingNameSpace" value="JMS,Local,Queue Browsers" />
   <property name="frequency" value="5000" />
   <property name="initialDelay" value="3000" />
   <property name="queueConnectionFactory" ref="RealJMSConnectionFactory"/>
   <property name="queues" ref="BrowserMonitorQueues"/>

The APM tree for this collector is displayed in Figure 19:

Figure 19. APM tree for JMSBrowserCollector
APM tree for JMSBrowserCollector
APM tree for JMSBrowserCollector

As a testing mechanism, a load script looped, sending a few hundred messages to each queue on a loop. In every loop, a queue was picked at random to purge. Accordingly, the maximum message age in each queue varied randomly over time.

Monitoring messaging systems using proprietary APIs

Some messaging systems have a proprietary API for implementing management functions such as monitoring. Several of these use their own messaging system in a request/response pattern to submit management requests. ActiveMQ (see Related topics) provides a JMS messaging-management API as well as a JMX-management API. Implementing a proprietary management API requires a custom collector. In this section I'll present a collector for WebSphere® MQ (formerly referred to as MQ Series). The collector uses a combination of two APIs:

  • MS0B: WebSphere MQ Java classes for PCF: The PCF API is an administrative API for WebSphere MQ.
  • The core WebSphere MQ Java classes: An API formerly referred to as MA88 has been integrated into the core WebSphere MQ Java class library (see Related topics).

The use of the two APIs is redundant, but the example exhibits the use of two different proprietary APIs.

The Spring collector implementation is a class called It monitors all queues on a WebSphere MQ server, gathering each one's queue depth and the number of current open input and output handles. The configuration for the is shown in Listing 18:

Listing 18. WebSphere MQ collector
<bean id="MQPCFAgentCollector"
   <property name="scheduler" ref="CollectionScheduler" />
   <property name="logErrors" value="true" />
   <property name="tracingNameSpace" value="MQ, Queues" />
   <property name="frequency" value="5000" />
   <property name="initialDelay" value="3000" />
   <property name="channel" value="SERVER1.QM2"/>
   <property name="host" value=""/>
   <property name="port" value="50002"/>

The unique configuration properties here are:

  • host: The IP address of the WebSphere MQ server's host name
  • port: The port that the WebSphere MQ process is listening on for connections
  • channel: The WebSphere MQ channel to connect to

Note that this example does not contain any authentication aspects.

Figure 20 displays the APM tree generated for the

Figure 20. APM tree for MQCollector
APM Tree for MQCollector

This concludes my discussion of messaging-service monitoring. As I promised earlier, I'll now cover monitoring with SNMP.

Monitoring using SNMP

SNMP was originally created as an application-layer protocol for exchanging information between network devices such as routers, firewalls, and switches. This is still probably its most commonly used function, but it also serves as a flexible and standardized protocol for performance and availability monitoring. The whole subject of SNMP and its implementation as a monitoring tool is larger than the summary scope of this article. However, SNMP has become so ubiquitous in the monitoring field that I would be remiss in not covering the topic to some extent.

One of the core structures in SNMP is the agent, which is responsible for brokering SNMP requests targeted at a specific device. The relative simplicity and low-level nature of SNMP makes it straightforward and efficient to embed an SNMP agent into a wide range of hardware devices and software services. Consequently, SNMP promises one standardized protocol to monitor the most number of services in an application's ecosystem. In addition, SNMP is widely used to execute discovery by scanning a range of IP addresses and ports. From a monitoring perspective, this can save some administrative overhead by automatically populating and updating a centralized inventory of monitoring targets. In many respects, SNMP is a close analog to JMX. Despite some of the obvious differences, it is possible to draw several equivalencies between the two, and interoperability between JMX and SNMP is widely supported and implemented. Table 1 summarizes some of the equivalencies:

Table 1. SNMP and JMX comparison
SNMP structureEquivalent JMX structure
AgentAgent or MBeanServer
ManagerClient, MBeanServerConnection, Protocol Adapter
MIBMBean, ObjectName, MBeanInfo
OIDObjectName and ObjectName+ Attribute name
TrapJMX Notification
GET, SETgetAttribute, setAttribute

From a simple monitoring perspective, the critical factors I need to know when issuing an SNMP inquiry are:

  • Host address: The IP address or host name where the target SNMP agent resides.
  • Port: The port the target SNMP agent is listening on. Because a single network address may be fronting a number of SNMP agents, each one needs to listen on a different port.
  • Protocol version: The SNMP protocol has gone through a number of revisions and support levels vary by agent. Choices are: 1, 2c, and 3.
  • Community: The SNMP community is a loosely defined administrative domain. An SNMP client cannot issue requests against an agent unless the community is known, so it serves in part as a loose form of authentication.
  • OID: This is a unique identifier of a metric or group of metrics. The format is a series of dot-separated integers. For example, the SNMP OID for a Linux host's 1 Minute Load is . and the OID for the subset of metrics consisting of 1, 5, and 15 Minute Loads is .

Aside from the community, some agents can define additional layers of authentication.

Before I dive into SNMP APIs, note that SNMP metrics can be retrieved using two common command-line utilities: snmpget, which retrieves the value of one OID, and snmpwalk, which retrieves a subset of OID values. With this in mind, I can always extend my ShellCollector CommandSet to trace SNMP OID values. Listing 19 demonstrates an example of snmpwalk with raw and cleaned outputs retrieving the 1, 5, and 15 Minute Loads on a Linux host. I am using version 2c of the protocol and the public community.

Listing 19. Example of snmpwalk
$> snmpwalk -v 2c -c public .
UCD-SNMP-MIB::laLoad.1 = STRING: 0.22
UCD-SNMP-MIB::laLoad.2 = STRING: 0.27
UCD-SNMP-MIB::laLoad.3 = STRING: 0.26

$> snmpwalk -v 2c -c public . \
   | awk '{ split($1, name, "::"); print name[2] " " $4}'
laLoad.1 0.32
laLoad.2 0.23
laLoad.3 0.22

The second command can be easily represented in my Linux command set, as shown in Listing 20:

Listing 20. CommandSet for handling a snmpwalk command
<CommandSet name="LinuxSNMP">
      <shellCommand><![CDATA[snmpwalk -v 2c -c public
      . | awk '{ split($1, name, "::"); print name[2] "
      " $4}']]></shellCommand>
             <paragraph id="0" name="System Load Summary" header="false"/>
                <columns entryName="0" values="1" offset="0">
                   <namemapping from="laLoad.1" to="1 Minute Load"/>
                    <namemapping from="laLoad.2" to="5 Minute Load"/>
                    <namemapping from="laLoad.3" to="15 Minute Load"/>
                <tracers default="SINT"/>

There are several commercial and open source SNMP Java APIs. I have implemented a basic Spring collector called org.runtimemonitoring.spring.collectors.snmp.SNMPCollector, which uses an open source API called joeSNMP (see Related topics). The collector has the following critical configuration properties:

  • hostName: The IP address or host name of the target host.
  • port: The port number the target SNMP agent is listening on (defaults to 161).
  • targets: A set of SNMP OID targets made up of instances of org.runtimemonitoring.spring.collectors.snmp.SNMPCollection. The configuration properties for that bean are:
    • nameSpace: The tracing namespace suffix.
    • oid: The SNMP OID of the target metric.
  • protocol: The SNMP protocol: 0 for v1 and 1 for v2 (defaults to v1).
  • community: The SNMP community (defaults to public).
  • retries: The number of times to attempt the operation (defaults to 1).
  • timeOut: The timeout of the SNMP call in ms (defaults to 5000).

A sample configuration of an SNMPCollector setup to monitor my local JBoss application server is shown in Listing 21:

Listing 21. Configuration for the SNMPCollector
<!-- Defines the SNMP OIDs I want to collect and the mapped name -->
<bean id="JBossSNMPProfile" class="java.util.HashSet">
     <bean class="org.runtimemonitoring.spring.collectors.snmp.SNMPCollection">
             <property name="nameSpace" value="Free Memory"/>
             <property name="oid" value="."/>
     <bean class="org.runtimemonitoring.spring.collectors.snmp.SNMPCollection">
             <property name="nameSpace" value="Max Memory"/>
             <property name="oid" value="."/>
     <bean class="org.runtimemonitoring.spring.collectors.snmp.SNMPCollection">
             <property name="nameSpace" value="Thread Pool Queue Size"/>
             <property name="oid" value="."/>
     <bean class="org.runtimemonitoring.spring.collectors.snmp.SNMPCollection">
             <property name="nameSpace" value="TX Manager, Rollback Count"/>
             <property name="oid" value="."/>
     <bean class="org.runtimemonitoring.spring.collectors.snmp.SNMPCollection">
             <property name="nameSpace" value="TX Manager, Current Count"/>
             <property name="oid" value="."/>

<!-- Configures an SNMP collector for my local JBoss Server -->
<bean id="LocalJBossSNMP"
   <property name="scheduler" ref="CollectionScheduler" />
   <property name="logErrors" value="true" />
   <property name="tracingNameSpace" value="Local,JBoss,SNMP" />
   <property name="frequency" value="5000" />
   <property name="initialDelay" value="3000" />
   <property name="hostName" value="localhost"/>
   <property name="port" value="1161"/>
   <property name="targets" ref="JBossSNMPProfile"/>

The collector does have some shortcomings in that the configuration is somewhat verbose, and the runtime is inefficient because it makes one call per OID instead of bulk collecting. The snmp-collectors.xml file in the sample code (see Download) also contains an example SNMP collector configuration for Linux server monitoring. Figure 21 displays the APM system metric tree:

Figure 21. APM tree for SNMPCollector
APM Tree for SNMPCollector
APM Tree for SNMPCollector

At this stage, you probably get the idea of how to create collectors. Full coverage of an environment may require several different types of collectors, and this article's source code contains additional examples of collectors for other monitoring targets if you are interested. They are all in the org.runtimemonitoring.spring.collectors package. These are summarized in Table 2:

Table 2. Additional collector examples
Collection targetClass
Web services: Checks the response time and availability of secure Web services webservice.MutualAuthWSClient
Web services and URL checkswebservice.NoAuthWSClient
Apache Web Server, performance and availabilityapacheweb.ApacheModStatusCollector
NTop: A utility for collecting detailed network statisticsnetwork.NTopHostTrafficCollector

Data management

One of the most complex challenges of a large and busy availability and performance data gathering system is the efficient persistence of the gathered data to a common metrics database. The considerations for the database and the persistence mechanism are:

  • The metrics database must support reasonably fast and simple querying of historical metrics for the generation of data visualizations, reporting, and analysis.
  • The metrics database must retain history and granularity of data to support the time windows, accuracy, and the required precision of reporting.
  • The persistence mechanism must perform sufficiently well and concurrently to avoid affecting the liveliness of the front-end monitoring.
  • The retrieval and storage of metric data must be able to run concurrently without one having a negative impact on the other.
  • Requests for data from the database should be able to support aggregation over periods of time.
  • The data in the database should be stored in a way that allows the retrieval of data in a time-series pattern or some mechanism that guarantees that multiple data points are significantly correlated if associated with the same effective time period.

In view of these considerations, you need:

  • A good-performing and scalable database with a lot of disk space.
  • A database with efficient search algorithms. Essentially, because metrics will be stored by compound name, one solution is to store the compound name as a string and use some form of pattern matching in order to specify the target query metrics. Many relational databases have support for regex built into the SQL syntax, which is ideal for querying by compound names but tends to be slow because it typically precludes the use of indexes. However, many relational databases also support functional indexes that could be used to speed up queries when using a pattern matching search. Another alternative is to completely normalize the data and break out the individual segments of the compound name (see Normalized vs. flat database structure below).
  • A strategy for limiting the number of writes and the total data volume written to the database is to implement a series of tiered aggregation and conflation. This can be done before the data is written to the database by keeping a rollup buffer of metrics. For a fixed period of time, you write all metrics tagged for storage to an accumulating buffer, which keeps the metric effective start time and end time, and the average, minimum, and maximum values of the metric for that period. In this way, metric values are conflated and aggregated before being written to the database. For example, if a collector's frequency is 15 seconds, and the aggregation period is 1 minute, 4 individual metrics will be conflated to 1. This assumes some tolerance for loss of granularity in persisted data, so there's a trade-off between lower granularity with lower data volume and higher granularity at the cost of higher data volume. At the same time, real-time visualization of metrics can be achieved by rendering graphics from preaggregation in-memory circular buffers so only persisted data is aggregated. Of course, you can be selective about which metrics actually need to be persisted.
  • An additional strategy for reducing data volume stored is to implement a sampling frequency where only x out of every y raw metrics are stored. Again, this reduces granularity of persisted data but will use fewer resources (especially memory) than maintaining aggregation buffers.
  • Metric data can also be rolled up and purged in the metric database after persistence. Again, within the tolerance for loss of granularity, you can roll up periods of data in the database into summary tables for each hour, day, week, month, and so on; purge the raw data; and still retain reasonable and useful sets of metric data.
  • In order to avoid affecting the data collectors themselves through the activity of the data-persistence process, it is critical to implement noninvasive background threads that can flush discrete collections of metric data to a data-storage process.
  • It can be challenging to create accurate time-series-like reports after the fact from multiple data sets that do not have the same effective time window. Consider the x axis of a graph, which represents time, and the y axis, which represents the value of a specific metric and multiple series (lines or plots) that represent readings for the same metric from a set of different sources. If the effective timestamp of the readings for each series is significantly different, the data must be massaged to keep the graph representation valid. This can be done by aggregating the data up to the lowest common consistent time window between all the plotted series, but again, this loses granularity. A much simpler solution is to maintain a consistent effective timestamp across all metric values. This is a feature of time-series databases. A commonly used one is RRDTool, which effectively enforces consistent and evenly spaced timestamps across different data values in a series. In order to keep effective timestamps consistent in a relational database, a good strategy is to take samplings of all metrics in line with a single uniform scheduler. For example, a single timer might fire off every two minutes, resulting in a single "swipe" of all metrics captured at that time, and all are tagged with the same timestamp and then queued for persistence at the persistor's leisure.

Obviously, it is challenging to address each and every one of these issues in a completely optimal way, and compromises must be considered. Figure 22 illustrates a conceptual data-persistence flow for a data collector:

Figure 22. Data-management flow
Data management flow
Data management flow

Figure 22 represents the data flow from a conceptual data collector like the Spring collectors presented in this article. The objective is to push individual-metric trace measurements through a series of tiers until they are persisted. The process is:

  1. The data collector is optionally configured for metric persistence with the following properties:
    • Persistence enabled: True or false.
    • Metric name filter: A regular expression. Metric names that match the regular expression are marked for persistence beyond the historical cache.
    • Sampling count: The number of metric collections that should skip persistence for every one that is. 0 would mean no skips.
    • Historical cache size: The number of metrics traced that should be stored in cache. For real-time visualization purposes, most metrics should be enabled for historical cache even if not marked for persistence so that they can be rendered into real-time charts.
  2. The collector cache is a number of discrete metric readings equal to the number of distinct metrics being generated by the collector. The cache is first in, first out (FIFO), so when the historical cache reaches full size, the oldest metrics are discarded to make room for the newest. The cache supports the registration of cache event listeners that can be notified of individual metrics being added and dropped from the cache.
  3. The collector rollup is a series of two or more circular caches of one instance of each metric being generated by the collector. As each cache instance in the circular buffer becomes active, it registers new trace events for historical cache and aggregates each incoming new value, effectively conflating the stream of data for a given period. The rollup contains:
    • The start and end time of the period. (The end time isn't set until the end of the period.)
    • The minimum, maximum, and average reading for the period.
    • The metric type.
  4. A central timer fires a flush event at the end of every period. When the timer fires, the collector rollup circular buffer index is incremented, and historical cache events are delivered to the next cache in the buffer. Then each aggregated rollup in the "closed" buffer has a common timestamp applied to it and is written to a persistence queue waiting to be stored to the database.

    Note that when the circular buffer increments, the minimum, maximum, and average values in the next buffer element will be zero, except for sticky metric types.

  5. A pool of threads in the persistence thread pool reads the rollup items from the persistence queue and writes them to the database.

The length of the aggregation period during which the same collector rollups are being conflated is determined by the frequency of the central flush timer. The longer the period, the less data is written to the database, at the cost of data granularity.

Normalized vs. flat database structure

Several specialized databases, such as RRDTool, might be considered for metric storage, but relational databases are used frequently. When you implement metric storage in a relational database, you have two general options to consider: normalizing the data or keeping a flatter structure. This is mostly applicable to how the compound metric names are stored; all other reference data should probably be normalized in line with your data-modeling best practices.

The benefit of a completely normalized database structure is that by breaking out the metric compound names into their individual segments, virtually any database can take advantage of indexing to speed up queries. They also store less redundant data, which in turn leads to a smaller database, higher density of data, and better performance. The downside is the complexity and size of the queries: even a relatively simple compound metric name, or pattern of metric names (for example, % User CPU on hosts 11 through 93) requires that the SQL contain several joins and many predicates. This can be mitigated through the use of database views and the hard storage of common metric name decodes. The storage of each individual metric requires the decomposition of the compound name to locate the metric reference data item (or create a new one), but this performance overhead can be mitigated by caching all the reference data in the persistence process.

Figure 23 shows a model for storing compound metrics in a normalized structure:

Figure 23. A normalized model for metric storage
A normalized model for metric storage
A normalized model for metric storage

In Figure 23, each individual unique compound metric is assigned a unique METRIC_ID, and each individual unique segment is assigned a SEGMENT_ID. Then compound names are built in an associative entity that contains the METRIC_ID, SEGMENT_ID, and the sequence that the segments appear in. The metric types are stored in the reference table METRIC_TYPE, and the metric values themselves are stored in METRIC_INSTANCE with value, timestamp start and end properties, and then references to the metric type and the unique METRIC_ID.

On the other hand, a flat model is compelling in its simplicity, as illustrated in Figure 24:

Figure 24. A flat model for metric storage
A flat model for metric storage
A flat model for metric storage

In this case, I have separated out the metric name from the compound name, and the remaining segments are retained in their original pattern in the segments column. Again, if the database engine implemented is capable of performing queries that perform well on wide text columns with pattern-based predicates such a regular expression, this model has the virtue of being simple to query against. This aspect should not be undervalued. The building of data-analysis, visualization, and reporting tools is significantly streamlined with a simplified query structure, and speed of query writing during an emergency triage session is possibly the most important aspect of all!

If you need persistent data storage, picking the right database is a crucial step. Using a relational database is workable, provided it performs well enough and you can extract data from it formatted to your needs. The generation of time-series-driven data can be challenging, but through correct grouping and aggregation — and the use of additional tools such as JFreeChart (see Related topics) — you can generate good representative reports and graphics. If you elect instead to implement a more specialized database such as RRDTool, be prepared to go the long way around when it comes to extracts and reports after the fact. If the database does not support standards such as ODBC and JDBC, this will exclude the use of commonly available and standardized reporting tools.

This concludes my discussion of data management. This article's final section presents techniques for visualizing data in real time.

Visualization and reporting

At some point, you will have implemented your instrumentation and performance data collectors, and data will be streaming into your APM system. The next logical step is to see a visual representation of that data in real time. I use the term real time loosely here to mean that visualizations represent data that was collected very recently.

The commonly used term for data visualizations is dashboard. Dashboards can present virtually any aspect of data pattern or activity that you can think of, and they are limited only by the quality and quantity of the data being collected. In essence, a dashboard tells you what's going on in your ecosystem. One of the real powers of dashboards in APM systems is the capability to represent vastly heterogeneous data (that is, data collected from different sources) in one uniform display. For example, one display can simultaneously show recent and current trends in CPU usage on the database, network activity between the application servers, and the number of users currently logged into your application. In this section, I'll present different styles of data visualization and an example implementation of a visualization layer for the Spring collector I presented earlier in this article.

The premises of the Spring collector visualization layer are:

  • An instance of a cache class is deployed as a Spring bean. The cache class is configurable to retain any number of historical ITracer traces but is fixed-size and FIFO. A cache might be configured with a history size of 50, meaning that once the cache is fully populated, it retains the last 50 traces.
  • Spring collectors are configured in the Spring XML configuration file with a cacheConfiguration that wraps the instance of ITracer with a caching process. The configuration also associates the collector to the defined cache instance. Collected traces are processed as they normally are, but are added the cache instance associated with the collector. Using the preceding example, if the cache has a history size of 50 and the collector collects data every 30 seconds, the cache, when fully populated, retains the last 25 minutes of all traces collected by the collector.
  • The Spring collector instance has a number of rendering classes deployed. Renderers are classes that implement the org.runtimemonitoring.spring.rendering.IRenderer interface. Their job is to acquire arrays of data from the caches and render some form of visualization from that data. Periodic retrieval of the visualization media from a renderer generates fresh and up-to-date presentations, or as up to date as the cache data is.
  • The rendered content can then be delivered to a client within the context of a dashboard such as a Web browser or some form of rich client.

Figure 25 outlines this process:

Figure 25. Caching and rendering
Caching and rendering

The cache implementation in this example is org.runtimemonitoring.spring.collectors.cache.TraceCacheStore. Other objects can register to be cache event listeners, so among other events, the renderers can listen on new cache item events indicating a new value has been added to the cache. In this way, the renderers can actually cache the content they generate but invalidate the cache when new data is available. The content from the renderers is delivered to client dashboards through a servlet called org.runtimemonitoring.spring.rendering.MediaServlet. The servlet parses the requests it receives, locates the renderer, and requests the content (all content is rendered and delivered as a byte array) and the content's MIME type. The byte array is then streamed to the client along with the MIME type so the client can interpret the stream. Serving graphical content from a URL-based service is ideal, because it can be consumed by Web browsers, rich clients, and everything between. When the renderers receive a request for content from the media server, the content is delivered from cache unless the cache has been marked dirty by a cache event. In this way, the renderers do not need to regenerate their content on every request.

Generating, caching, and delivering visual media in byte-array format is useful because it is the lowest common denominator, and most clients can reconstitute the content when provided the MIME type. Because this implementation caches generated content in memory, I use a compression scheme. The total memory consumption adds up significantly for a lot of cached content; once again, if the compression-algorithm symbol is provided with the content, most clients can decompress. Most contemporary browsers, for example, support gzip decompression. However, reasonable compression levels are not especially high (I'm seeing from 30 to 40 percent on larger images), so rendering implementations can either cache to disk, or if disk access is more overhead, regenerating content on the fly might be less resource-intensive.

A specific example will be useful here. I set up two Apache Web Server collectors to monitor the number of busy worker threads. Each collector has an assigned cache, and I set up a small number of renderers to provide charts to display the number of busy workers on each server. In this case, the renderer generates a PNG file displaying a time-series line graph with series for both servers. The collector and cache setup for one server is shown in Listing 22:

Listing 22. An Apache Web Server collector and cache
<!-- The Apache Collector -->
<bean id="Apache2-AP02"
   <property name="scheduler" ref="CollectionScheduler" />
   <property name="logErrors" value="true" />
   <property name="tracingNameSpace" value="WebServers,Apache" />
   <property name="frequency" value="15000" />
   <property name="initialDelay" value="3000" />
   <property name="modStatusURL" value="http://WebAP02/server-status?auto" />
   <property name="name" value="Apache2-AP02" />
   <property name="cacheConfiguration">
         <property name="cacheStores" ref="Apache2-AP02-Cache"/>

<!-- The Apache Collector Cache -->
<bean id="Apache2-AP02-Cache"
   <constructor-arg value="50"/>

Note the cacheConfiguration property in the collector and how it references the cache object called Apache2-AP02-Cache.

I also set up a renderer that is an instance of org.runtimemonitoring.spring.rendering.GroovyRenderer. This renderer delegates all rendering to an underlying Groovy script on the file system. This is ideal, because I can tweak it at run time to fine-tune details of the generated graphic. This renderer's general properties are:

  • groovyRenderer: A reference to a org.runtimemonitoring.spring.groovy.GroovyScriptManager, which is configured to load a Groovy script from a directory. This is the same class I used to massage data returned from the Telnet session to my Cisco CSS.
  • dataCaches: A set of caches that the renderer requests data from and renders. The renderer also registers to receive events from the caches when they add new items. When it does, it marks its content as dirty, and it is regenerated on the next request.
  • renderingProperties: Default properties passed to the renderer that direct specific details of the generated graphic, such as the image's default size. As you'll see below, these properties can be overridden by the client request.
  • metricLocatorFilters: A collector cache contains cached traces for every metric generated by the collector. This property allows you to specify an array of regular expressions to filter down which metrics you want.

The cache setup is shown in Listing 23:

Listing 23. Graphic renderer for Apache Web Server busy worker monitoring
<bean id="Apache2-All-BusyWorkers-Line"
   <property name="groovyRenderer">
      <bean class="org.runtimemonitoring.spring.groovy.GroovyScriptManager">
         <property name="sourceUrl" value="file:///groovy/rendering/multiLine.groovy"/>
   <property name="dataCaches">
         <ref bean="Apache2-AP01-Cache"/>
          <ref bean="Apache2-AP02-Cache"/>
   <property name="renderingProperties">
        title=Apache Servers Busy Workers
        yAxisName=# of Workers Busy
   <property name="metricLocatorFilters" value=".*/Busy Workers"/>

Renderers are fairly straightforward to implement, but I find that I constantly want to tweak them, so the Groovy approach outlined here works well to prototype a new chart type, or perhaps a new graphics package, quickly. Once the Groovy code compiles, the performance is good and with good content caching should not be an issue. The dynamic hot update and highly functional Groovy syntax make it easy to make updates on the fly. Later on, when I have figured out exactly what I want the renderer to do and what all the options that it should support are, I'll port them over to Java code.

The naming of metrics is generated by the org.runtimemonitoring.tracing.Trace class. Each instance of this class represents one ITracer reading, so it is an encapsulation of the value traced, the time stamp, and the full namespace. The name of the metric is the full namespace, including the metric name. In this case, the metric I am displaying is WebServers/Apache/Apache2-AP01/Busy Workers, so the filters defined in the renderer in Listing 23 zones in on this one metric for rendering. The JPG generated is shown in Figure 26:

Figure 26. Rendered Apache busy workers
Rendered Apache busy workers
Rendered Apache busy workers

Different clients may require differently rendered graphics. For instance, one client may require a smaller image. Images resized on the fly typically get blurred. Another client may require an image that is smaller still and may want to dispense with the title (and provide a title in its own UI). The MediaServlet allows additional options to be implemented during content requests. These options are appended the content request's URL and are processed in REST format. The basic format is the media servlet path (this is configurable) followed by the cache name, or /media/Apache2-All-BusyWorkers-Line. Each renderer can support different options. For the renderer used above, the following options provide a good example of this:

  • Default URI: /media/Apache2-All-BusyWorkers-Line
  • Reduced to 300 X 300: /media/Apache2-All-BusyWorkers-Line/300/300
  • Reduced to 300 X 300 with minimal title and axis names: /media/Apache2-All-BusyWorkers-Line/300/300/BusyWorkers/Time/#Workers

Figure 27 shows two reduced pie charts with no title using the URI Apache2-AP02-WorkerStatus-Pie/150/150/ /:

Figure 27. Reduced images of Apache Server worker pools
Shrunk Images of Apache Server Worker Pools
Shrunk Images of Apache Server Worker Pools

Renderers can generate content in virtually any format that can be displayed by the client requesting it. Image formats can be JPG, PNG, or GIF. Other image formats are supportable, but for static images targeted for Web browser clients, PNG and GIF probably work best. Other options for formats are text-based using markup such as HTML. Browsers and rich clients can both render fragments of HTML, which can be ideal for displaying individual data fields and cross-tabular tables. Plain text can also be useful. For example, a Web browser client might retrieve text from a renderer that represents system-generated event messages and insert it into a text box or list box. Other types of markup are also highly adaptable. Many rich clients and client-side rendering packages for browsers read in XML documents defining graphics that can then be generated on the client side, which is optimal for performance.

Client-side rendering offers an additional opportunity for optimization. If a client can render its own visualizations, then it is possible to stream cache updates directly to the client, bypassing the renderers unless they are needed to add markup tags. In this way, a client can subscribe to cache update events and, on receipt of them, update its own visualizations. Streaming data to clients can be done in a number of ways. In browser clients, a simple Ajax-style poller can periodically check the server for updates and implement a handler that inserts any updates into the data structure handling the rendering in the browser. Other options that are slightly more complicated involve real streaming of data using the Comet pattern, whereby a connection to the server remains open at all times and data is read by the client as it is written by the server (see Related topics). For rich clients, using a messaging system is ideal where clients subscribe to data-update feeds. ActiveMQ has the ability to do both in that in conjunction with the Jetty Web server and its Comet capabilities, it is possible to create a browser-based JavaScript JMS client and subscribe to queues and topics.

The rich rendering possible on the client side also adds capabilities not available with flat images, such as the ability to click on elements for drill-down — a common requirement in APM dashboards where drill-down is used to navigate or see specific items represented in charts in more detail. An example of this is Visifire, an open source charting tool that works with Silverlight (see Related topics). Listing 24 shows a fragment of XML that generates a bar chart showing CPU utilization across database servers:

Listing 24. Graphic renderer for database average CPU utilization
<vc:Chart xmlns:vc="clr-namespace:Visifire.Charts;assembly=Visifire.Charts" 
         <vc:Title Text="Average CPU Utilization on  Database Servers"/>
         <vc:AxisY Prefix="%" Title="Utilization"/>
         <vc:DataSeries Name="Utilization" RenderAs="Column">
                  <vc:DataPoint AxisLabel="DB01" YValue="13"/>
                  <vc:DataPoint AxisLabel="DB02" YValue="57"/>
                  <vc:DataPoint AxisLabel="DB03" YValue="41"/>
                  <vc:DataPoint AxisLabel="DB04" YValue="10"/>
                  <vc:DataPoint AxisLabel="DB05" YValue="30"/>

The XML is fairly trivial, so it is simple to create a renderer for it, and the presentation is quite nice. Client-side renderers can also animate visualizations, which for an APM system display is of variable value, but in some cases, may be helpful. Figure 28 shows the graph generated in a browser with the Silverlight client enabled:

Figure 28. A VisiFire Silverlight rendered chart
A VisiFire Silveright rendered chart
A VisiFire Silveright rendered chart

The standard chart types all have a place in APM dashboards. The most common are line, multiline, bar, and pie charts. Often charts are combined, such as bar and lines to display overlaps of two different types of data. In other cases, a line chart has a double y-axis so that data series representing values of significantly different magnitudes can be represented on the same graph. This might be the case when plotting a percentage value against a scalar value such as % CPU Utilization on a router against the number of bytes transferred.

In some scenarios, specialized widgets can be created to represent data in a customized manner, or because they display data in an intuitive display. For example, enumerated symbols display a specific icon in accordance with the status of the monitoring target. Because a status is usually represented by a limited number of potential values, charting is overkill, so something like a traffic-light display represents red for down, amber for warning, and green for okay. Another popular widget for this is the dial (often represented as a speedometer). I think dials are a waste of screen space, because they display only one vector of data with no history. A line graph shows the same data and the historical trend to boot. One exception is that multineedle dials can show ranges such as high/low/current. But for the most part, they're for visual appeal, like the ones in Figure 29 showing database block buffer gets per second with the high/low/current for the past hour:

Figure 29. Example dial widgets
Example dial widgets
Example dial widgets

In my view, the premium for visualization is on density of information. Screen space is limited, and I want to see as much data as possible. Data density can be achieved in several ways, but some of the more interesting ones are custom graphic representations that combine multiple dimensions of data in one small picture. A good example, shown in Figure 30, is from a (sadly) retired database-monitoring product:

Figure 30. Savant cache hit ratio display
Savant Cache Hit Ratio Display
Savant Cache Hit Ratio Display

Figure 30 displays several vectors of data revolving around database buffer hit ratios:

  • The horizontal axis represents the percentage of data found in cache.
  • The vertical axis represents the current hit ratio.
  • The X represents the trending hit ratio.
  • The gray circle (barely visible in this image) is the standard variation. The larger the diameter of the gray circle, the more variation there has been in cache performance.
  • The yellow ball represents cache performance over the last hour.

A second approach to data density that takes a more minimalist approach is Sparklines. This term was coined by data visualization expert Edward Tufte (see Related topics) for "small, high resolution graphics embedded in a context of words, numbers, images." They are commonly used to display a large number of financial statistics. Although they lack context, their purpose is to display relative trends across many metrics. A Sparkline renderer is implemented by the org.runtimemonitoring.spring.rendering.SparkLineRenderer class, which implements the open source Sparklines for Java library. (see Related topics) An example of two (magnified) Sparklines is illustrated in Figure 31, showing bar- and line-based displays:

Figure 31. Sparklines showing Apache 2 busy workers
Sparklines showing Apache 2 busy workers
Sparklines showing Apache 2 busy workers

The examples outlined here and in the attached code are fairly basic, but clearly an APM system requires highly advanced and detailed dashboards. Moreover, most users will not want to create dashboards from scratch. APM systems usually have some form of dashboard generator that allows a user to view or search a repository of available metrics and pick which ones to embed in the dashboard and in what format they should be displayed. Figure 32 displays a section of a dashboard I created with my APM system:

Figure 32. Dashboard example
Dashboard Example
Dashboard Example


This concludes my series. I have presented guidelines, some general performance-monitoring techniques, and specific development patterns you can implement to enhance or build your own monitoring system. Collecting and analyzing good data can significantly improve your application uptime and performance. I encourage developers to participate in the process of monitoring production applications: there is no better source of information to determine and experience what is really going on with the software you've written as it runs under load. This feedback is invaluable as part of an ongoing improvement cycle. Happy monitoring!


A big thanks to Sandeep Malhotra for his assistance with the Web service collectors.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Java development, Linux
ArticleTitle=Java run-time monitoring, Part 3: Monitoring performance and availability of an application's ecosystem