In order to maintain the overall health of any enterprise computing system, you must monitor it. IBM® WebSphere® eXtreme Scale is no different in this respect. Whether you have WebSphere eXtreme Scale deployed into an IBM WebSphere Application Server-managed JVM (Java™ Virtual Machine) or into a standalone J2SE JVM, you already have some decent monitoring options available to you.
For example, if your WebSphere eXtreme Scale is deployed in a WebSphere Application Server-managed environment, you can leverage WebSphere Application Server's Performance Monitoring Infrastructure (PMI) to gather and analyze metrics collected from the different parts of the WebSphere eXtreme Scale environment.
Regardless of the style of deployment you have or the monitoring tools you use, this article points out some of the key performance indicators (KPIs) that you should be monitoring on your WebSphere eXtreme Scale environment.
First, though, for the information presented here to be useful to you, here are a few quick assumptions and explanations:
- You should have working knowledge of WebSphere eXtreme Scale administration and configuration, and of applicable monitoring solutions, such as the IBM Tivoli® Performance Monitor (in WebSphere Application Server) or Wily Introscope. I bring this up only because the focus here is on the metrics to be monitored and what you should look for, not how to configure your environment to capture the metrics themselves.
- The groups of metrics listed below start at the lowest level of the software stack (the JVM) and culminate with the xsadmin tool at the WebSphere eXtreme Scale product level. This is because, in troubleshooting, it is often necessary to ascertain the health of the lower levels of the software stack and move upwards, especially when working with performance-related issues.
- Finally, the key metrics I point out here are based on my experiences with WebSphere eXtreme Scale and the customers I see who use the product. In other words, your mileage may vary.
I believe the KPIs for WebSphere eXtreme Scale fall into three categories which represent the critical path elements that decide if a WebSphere eXtreme Scale environment is healthy and performing optimally:
- Heap utilization
It should be pretty clear why heap space is important to your WebSphere eXtreme Scale environment; all the data that resides in your containers must fit within the confines of the Java heap. If you run out of available heap space, the JVM will garbage collect (sometimes excessively) in order to free up enough space to allocate the objects being requested by your application. If the JVM is unable to clear out enough space for an object allocation to occur, it will throw an OutOfMemoryError to indicate just that: there is no more available memory from which to allocate objects.
Noting the current heap utilization over time will give you insight into how much data you're storing in your container JVMs, and knowing how the memory usage is trending can help you make good decisions for capacity planning (that is, adding more JVMs or raising the maxium heap size for a JVM).
- Garbage collection frequency
When a JVM is performing garbage collection, it is generally not doing anything else (different algorithms behave differently). If a JVM has to spend more time cleaning up memory than executing application code, performance will suffer. Using the frequency of garbage collection cycles and calculating how much time per minute is being spent in garbage collection operations versus executing application code, you can calculate what is known as garbage collection overhead. Most enterprise monitoring tools will calculate this for you, but here’s the formula in case you want to figure it out yourself:
Time paused in GC ops / Time for period of interest = % of GC overhead
As a general rule of thumb, if your garbage collection overhead percentage edges upwards of 10-15%, then you probably have a "sick" JVM on your hands. Depending on your response time requirements, you might not be able to tolerate a JVM running at 12% garbage collection overhead. At any rate, determine what your percentage should look like under normal load early in the lifetime of the environment, and then look for anomalies under different load patterns. This metric is valuable when observed in parallel with garbage collection pause times, described next.
- Garbage collection pause times
In terms of garbage collection, pause time is defined as the amount of time (usually measured in ms) in which the JVM was performing garbage collection operations and the threads (and heap) were locked, and thus no application code was being executed. Obviously, if application code is not running, then nothing really cool is going on in the JVM (for example, in a container JVM, no data is being fetched, no replication going on, and so on). While garbage collection must be tolerated, arduously long pause times are another matter entirely. In most cases, long pause times can be attributed to things like CPU contention, execessive object allocation into an undersized heap, excessive load, and concurrent collection failures (in concurrent garbage collection algorithms). It is important to have this metric monitored to help answer questions like, "Are my WebSphere eXtreme Scale containers contributing to long response times that are throwing me out of my SLA?" or "Is the performance degradation on the client side or the grid side?"
Next, we'll look at some WebSphere eXtreme Scale-specific metrics that can be used to determine just how your grid clients and containers are faring.
If you’re using a profiling tool, WebSphere eXtreme Scale provides some helpful metrics that can be used to gain insight into the operations going on within your data grid environment. A few of the key things to watch include:
The HAControllerImpl group of metrics handle core group life cycle and feedback events. You can monitor this class to get an indication of the core group structure and changes. By monitoring the number of times the viewChangeCompleted method is called, you can get an idea of when your WebSphere eXtreme Scale environment is changing state across the grid. For example, starting and stopping grid JVMs will trigger view changes as the servers acknowledge the new or missing member and update their view of the available grid JVMs. One way this metric can be used to spot trouble is when you see responses from this method that indicate view changes are occuring – and yet you are not making any changes to the state of the grid (such as, starting or stopping grid JVMs). This situation might indicate that some JVMs have become unhealthy (for example, OutOfMemoryError), have crashed, or are being affected by a network partitioning event.
The ORBClientCoreMessageHandler group of metrics is responsible for sending application requests to the containers. You can monitor the sendMessage method for client response time and number of requests. This method (and the next one listed below) are great indicators for determining the amount of time being spent doing grid operations (such as CRUD interactions, queries, and so on) on your grid environment. By measuring the amount of time taken executing this method, you will know how long a round-trip call into the grid is taking. The time reported is inclusive of network latency between the client and container JVMs plus the amount of time taken to process the request on the container end.
The ShardImpl.processMessage method is reponsible for processing incoming client requests. In a similar vein to the the previous metric, you can use the execution times for this method to determine how much time is being spent working on client requests inside the container JVM. With this method, you can get server side response time and request counts. By using this metric and correlating other metrics, such as heap utilization, you can determine how balanced the workload of the grid actually is.
If you chose to augment an existing WebSphere Application Server environment with WebSphere eXtreme Scale, you have the option to use the WebSphere Application Server PMI to collect and display performance metrics. These metrics can be viewed with the Tivoli Performance Viewer tool via the WebSphere Application Server Integrated Solutions Console (or using a vendor JMX client like Jconsole). The metrics PMI can provide both operation response time and operational/systemic errors.
WebSphere eXtreme Scale provides asset of PMI modules that can be used to provide feedback:
- objectGridModule: Measures response time for each ObjectGrid instance.
- mapModule: Determines the size of each ObjectMap.
- queryModule: Measures the execution time of your OGQL queries.
- agentModule: Determines how well your agents are performing.
- hashIndexModule: Measures the effectiveness of your query indices.
Using just the PMI metrics, you can easily monitor the health of some of the key activities in your grid environment, such as average transaction response times, batch loader response time, agent duration time, and so on.
As a troubleshooting and monitoring aid, you might be tempted to overlook the xsadmin tool. However, xsadmin can provide key pieces of information that, when combined with other monitoring metrics, can yield a much clearer picture of the scenario at hand. For example, if response time metrics and garbage collection activity for a certain container JVM are showing signs of distress, a quick check of the container JVMs using xsadmin might help to explain why; for example, you might learn that other containers failed, which led to more shards moving to this JVM.
Some key information that xsadmin can yield include:
- Map row count.
- Container placement and client reachability.
- Shard placement across the available containers.
Whereas monitoring provides health and performance-related data, xsadmin adds to this by contributing topology data, again helping to clarify symptoms seen in other metrics.
There absolutely are more metrics to be monitored than the KPIs discussed here. Every user and environment will likely need a mix the metrics discussed in this article combied with others. In fact, the most valuable KPIs not listed here are the ones that you create yourself for your specific environment. The elements spotlighted here are KPIs for monitoring the system your applications are running on, but building your own statistics using your knowledge of the applications’ data flow could be even more valuable in terms of understanding why the system is behaving the way it is.
WebSphere eXtreme Scale Java API documentation
WebSphere eXtreme Scale Information Center
Redbool: User Guide to WebSphere eXtreme Scale
Redbook: IBM WebSphere eXtreme Scale V7: Solutions Architecture
IBM developerWorks WebSphere
Get products and technologies
John Pape currently works with the WebSphere SWAT Team and focuses on crit-sit support for clients who utilize WebSphere Application Server, WebSphere Portal Server, and WebSphere Extended Deployment. This role requires attention to detail as well and maintaining a “think-out-of-the-box” innovative mindset, all the while assuring IBM customers get the best support possible! John also has a development background in Domino/Lotus Notes, Microsoft C#, Java, C++, Perl, and python. John is a social media champion within IBM and an avid Linux user.