Introduction to grid monitoring and collecting statistical data
In any environment, it is important to monitor the health of the WebSphere eXtreme Scale grid and collect statistical information about grid usage. These are a few simple strategies that can be used to monitor the overall health of the grid and generate operational alerts when necessary.
The CPU usage of catalog servers is minimal when catalog containers are in a steady state. A steady state for catalog containers includes:
- No containers being added or removed from the grid.
- No new WebSphere eXtreme Scale clients trying to connect to the WebSphere eXtreme Scale grid.
It is for this reason that the CPU usage in catalog servers has a lower priority for monitoring. You can set the operational alerts generation that is based on any user CPU spike on catalog server machines of 85% or more.
The container servers, however, are the work horses of any WebSphere eXtreme Scale grid, and the CPU usage of container servers needs to be closely monitored. Although there is not a precise and specific formula, you can generate a medium severity (yellow) operational alert when the user CPU in any WebSphere eXtreme Scale container goes beyond 65%. If the user CPU in a WebSphere eXtreme Scale container climbs to 85% or higher, a high severity (red) operational alert can be generated.
The CPU percentages in WebSphere eXtreme Scale servers can be monitored by nmon or other similar tools. For more information about nmon, see the "nmon performance: A free tool to analyze AIX and Linux performance" developerWorks article.
To avoid memory issues, a grid is conservatively sized based on the expected amount of data to be cached. However, even a conservative estimate can prove to be inadequate in real life. Sometimes, real-life data access patterns cannot be accurately predicted beforehand, and the user heap space in each of the container JVMs should be monitored.
Ideally, the heap usage in each container JVM is under 60%. A yellow alert should be generated if any container JVM user heap space goes beyond 60% but stays within 75%. User heap-space usage exceeding 75% on any container JVM should generate a red alert. However, there is no real need to monitor the heap usage on the catalog servers.
Also, verbose_gc files are generated on the container JVMs at convenient intervals to get a feel for the average heap space usage, garbage collection frequency, garbage collection pause time, and other relevant information.
Monitoring with the server-side flight recorder
xsadmin utility for WebSphere eXtreme Scale (V7.1 and earlier) and
xscmd utility for WebSphere eXtreme Scale (V7.1 and later) can be
used to collect important grid attributes and statistical data at run time. With either of these two
utilities, you can develop and deploy simple scripts as server-side flight recorders for monitoring
and logging relevant grid attributes at periodic intervals. Also, depending on the values of the
certain monitored grid attributes, the scripts also generate operational alerts if necessary. The
following grid attributes can be monitored by the server-side flight recorder in real-life WebSphere
eXtreme Scale DynaCache environments.
Containers and shard distribution
xscmd -c showPlacement command (or its equivalent
xsadmin –containerscommand) outputs the distribution of shards among the live
WebSphere eXtreme Scale container servers. As explained earlier in "Tips and techniques
for WebSphere eXtreme Scale DynaCache in WebSphere Commerce environments, Part 2: WebSphere eXtreme
Scale grid sizing and configuration," in any WebSphere eXtreme Scale grid, the number of containers and shards is
known beforehand. If the total number of containers in the command output is less than the expected
number, the grid is working in an impaired condition, which results in fewer than the expected
number of containers. An alert should be generated for the administrator to investigate and restart
the dead containers.
For grids with replicas, also watch for the approximate uniform distribution of primary and
replica shards among the WebSphere eXtreme Scale containers. If the output of the commands shows a
shard imbalance, then the
xscmd –c balanceShardTypes command (or a number
xsadmin –swapShardWithPrimarycommands) can be issued to
balance the grid.
State of partitions
routetable command (or its
outputs the states of all the shards. The state of a shard can be either reachable or unreachable.
Alerts can be generated if the number of partitions does not match the
numberOfPartitions attribute that is specified in the grid deployment file, or if one or
more shards remains unreachable for some time (half a minute or more). After the alert generation
for an unreachable shard, three Java™
cores at one-minute intervals are taken from the corresponding container JVM for later analysis,
along with other system monitoring data.
NOTE: During internal state transitions, a shard can legitimately be in unreachable state for a short instance of time. It is rare to encounter a shard that remains permanently in an unreachable state. However, if you encounter such a shard, collect all the diagnostic information. Then the owning WebSphere eXtreme Scale containers can be restarted to attempt to resolve the situation. If a container restart fails to address the issue, you will have to restart all the containers.
In some environments, customers might decide to turn on the quorum for catalog servers. This prevents the splitting of the WebSphere eXtreme Scale grid into multiple independent grids. For more information about communication glitches between catalog servers, refer to the "Catalog server quorums" topic in the WebSphere eXtreme Scale information center, see Resources.
If the quorum flag is set, monitor the output of the
showQuorumStatus command (or
xsadmin –quorumStatuscommand). If the
quorum is broken, the active server count in the command output is less than the number of catalog
server cluster members. If the quorum remains broken for a long time (5 minutes), an alert should be
generated for the administrator to take necessary actions after careful analysis.
showMapSizes command (or its
mapsizes command) outputs important cache statistical information. This information includes
the number of elements that are cached in each partition, in each container, and in the entire grid,
along with the amount of heap space that is occupied by these cached entities. The space calculation
includes most of the overhead for storing the cached entities in the WebSphere eXtreme Scale grid.
By monitoring the total number of cached entities and their combined size over time, users can
obtain excellent information about important operational and sizing attributes, like cache growth
rate over time, peak cache growth period, cache growth during special promotion period, and average
object size. Based on this information, e-commerce application architects can fine-tune the
application caching strategies.
WebSphere eXtreme Scale exposes a number of MBeans, and with any available JMX-compliant monitoring tool, you can implement a server-side flight recorder for monitoring, statistics collection, and alert generation. For more information, refer to the "Package com.ibm.websphere.objectgrid.management" topic in the WebSphere eXtreme Scale information center documentation, see Resources.
Monitoring with the client-side flight recorder
Traditional DynaCache can be viewed, monitored, and managed from the GUI interface of the Cache Monitor utility. For WebSphere eXtreme Scale DynaCache, you might use a similar utility, or a special version of the Extended Cache Monitor (ECM) JEE application. Refer to the "IBM Extended Cache Monitor for IBM WebSphere Application Server technology preview" developerWorks article for utility details.
You must not use a wildcard query for WebSphere eXtreme Scale DynaCache using the ECM. Such a query might require a restart of the JVM from where the query was issued. For this reason, it is recommended that you update the cache monitor with the cachemonitor7_package_for_WXS85.zip file from the IBM Extended Cache Monitor web site, see Resources.
NOTE: The name of the zip file is misleading. The update works on any WebSphere eXtreme Scale version, not just WebSphere eXtreme Scale 8.5.
All the update instructions that are detailed in the "IBM Extended Cache Monitor for IBM WebSphere Application Server technology preview" developerWorks article must be properly followed to create a safe version of the ECM that does not issue a wildcard query.
The ECM can be installed in the entire commerce cluster. As a good operational practice, the ECM
is used only from one or two designated JVMs in the cluster. With ECM, you can look at the servlet
cache instances. To look at the servlet cache instances, you have to add the JVM custom property
com.ibm.ws.cache.CacheConfig.showObjectContents and set its value to
true" in the JVMs that are used to execute the ECM application.
The WebSphere eXtreme Scale console server (V8.5 or later) provides a powerful facility to query the keys of the cached objects using regular expressions. Based on the results of the query, you can individually invalidate one or more items in the cache. For details, refer to the "Query and invalidating data" topic in the WebSphere eXtreme Scale information center, see Resources.
The ECM in WebSphere eXtreme Scale DynaCache environments is used to track the rate of least recently used (LRU) evictions from DynaCache. Ideally, the rate of LRUs is zero to a small amount. A consistently high LRU eviction rate can indicate a poorly sized WebSphere eXtreme Scale grid, a poorly designed cache strategy, or both. Based on the LRU eviction rate, the server-side statistics of number of cached entities, and the server-side statistics size, you might have to appropriately increase the grid capacity or fine-tune the application caching strategy or both.
The ECM displays the LRU eviction on a GUI. To determine the rate of LRU eviction, you have to watch the LRU evictions field in the ECM GUI. As an alternative, you can deploy a client-side flight recorder to periodically collect DynaCache MBean statistics and place them in a CSV formatted file. DynaCacheStatisticsCSV.py, the main flight recorder code, can be obtained from the github.com web site, see Resources. For more information, refer to the "All Things WebSphere" blog and the prologue of the DynaCacheStatisticsCSV.py for sample usage of the powerful flight recorder script, see Resources. The client-side flight recorder data is useful in debugging and tuning client applications using DynaCache.
Both the client-side and the server-side flight recorder are used in load test environments and also in production. In a production environment, initially they are executed every 15 minutes. Later as the system stabilizes, they might be executed at a much longer interval, say every half an hour or 1 hour.
Using multi-data-center environments
The use of multiple data centers in active-passive mode for disaster recovery is fairly common in enterprise IT environments. WebSphere Commerce is used in multi-data-center installations, and is often in active-active mode sharing a backend database. Each data center typically contains one WebSphere Commerce cell. The WebSphere Commerce application that is deployed in the cells of all the data centers handles user traffic. Then, the traffic to each data center is workload that is managed by a standard IP sprayer.
For such a topology, it is recommended that you create multiple independent WebSphere eXtreme Scale DynaCache grids, one in each data center. Try not to create a single WebSphere eXtreme Scale DynaCache grid spanning multiple data centers. Also, for ease of configuration and maintenance you can create identical grids of the same name in each data center.
WebSphere Commerce provides a wide range of cache invalidation approaches. For more detail on these approaches, refer to the "Cache invalidation" topic on the WebSphere Commerce information center, see Resources.
In the context of WebSphere Commerce solution with WebSphere eXtreme Scale, we typically use:
- Command-based invalidation - This type of invalidation is triggered when a specific
business-logic command executes in the WebSphere Application Server container and a cached entity is
associated to it by the invalidation policies defined in the cache-Rule file (cachespec.xml).
Such types of invalidations are typically triggered by a Guest action on the web site or, a Business User changing a business rule in the WebSphere Commerce staging environment (for example, using commerce management center business interface). For details on the staging environment, refer to the "Staging server" topic on the WebSphere Commerce information center, see Resources.
- Scheduled invalidation - This type of invalidation is used when the data-state changes in
WebSphere Commerce database. This is the preferred approach for merchant data changes on the web
site, and it follows a pattern that is similar to a publish-subscribe design pattern. The
data-change and corresponding dependencyId of the cache entry to be invalidated are published to the
WebSphere Commerce CACHEIVL database table, when the subscriber to this event runs periodic checks
on it and performs the appropriate invalidation in the cache provider.
The key idea behind this approach is that the cached content in the cache provider (WebSphere eXtreme Scale) can be uniquely identified by dependency IDs (referred to as DID from here onwards). The same ID if entered in the CACHEIVL would be picked up by the WebSphere Commerce invalidation job to perform the invalidation in WebSphere eXtreme Scale. This cache invalidation scheduler command typically runs every 10 minutes, reads all the records from the CACHEIVL table, and makes the DynaCache invalidation API call with the specified DIDs. It is highly recommended that you create a specific commerce instance and cluster for performing scheduler-based invalidations.
Figure 1. Invalidation process flow
WebSphere Commerce cache invalidation recommendations
A fine-grained invalidation approach is recommended because it evicts the dirty cache-entity only and allows the other unrelated cache-entities to remain in the cache. The solution determines the affected cache-entity dependencyIDs (DIDs) and uses those to perform precise invalidation of the cached content, rather than clearing all the cached entities. An invalidation event is not idempotent and cheap; you need to ensure that during a transaction, only one invalidation event occurs for a given dependencyID. You also need to implement the cache invalidation process based on the dynamic cache invalidation approach of the WebSphere Commerce scheduler framework. This framework invalidates specific cache entities based on the comparison of the CACHEIVL database table's DATAID value to the cache entity dependencyID value. For more information on the cache invalidation framework in WebSphere Commerce, refer to the "DynaCacheInvalidation URL" topic on the WebSphere Commerce information center, see Resources.
Command invalidation is implemented with one of the following options to avoid duplicate invalidation messages.
- Mention only one command in the cachespec.xml for issuing the invalidation call per unique transaction.
- Modify the WebSphere Commerce Command class to issue the programmatic invalidation when the specific user transaction is encountered. This is best implemented by extending the processExecute method of the concerned command (for example: PromotionEngineOrderCalculateCmdImpl) and then issuing the invalidation towards the end of the method body. If you use this approach, you need to set the delayed invalidation attribute to true in the invalidation API.
In Listing 1 and Listing 2, the system is expected to update the mini-cart total based on the user action. In the correct approach of Listing 2, only one command is mentioned in cachespec.xml that issues the invalidation in all the possible user interactions. This one command for all user interactions avoids unnecessary invalidation traffic.
Listing 1. Incorrect command invalidation approach
<!-- ******************************************************* --> <!-- Start Invalidation rules for Shopping Totals. --> <!-- ******************************************************* --> <!-- We add this rule so the Shopping Totals will always be --> <!-- correct when users modify their cart. This is a --> <!-- invalidation policy that needs to be invalidated based --> <!-- on a shoppers action. --> <!-- ******************************************************* --> <cache-entry> <class>command</class> <sharing-policy>not-shared</sharing-policy> <name>com.ibm.commerce.order.commands.OrderCalculateCmdImpl</name> <name>com.ibm.commerce.order.commands.PromotionEngineOrderCalculateCmdImpl</name> <name>com.ibm.commerce.orderitems.commands.OrderItemMoveCmdImpl</name> <name>com.ibm.commerce.usermanagement.commands.UserRegistrationAddCmdImpl</name> <name>com.ibm.commerce.usermanagement.commands.UserRegistrationUpdateCmdImpl</name> <name>com.ibm.commerce.order.commands.OrderProcessCmdImpl</name> <name>com.ibm.commerce.orderitems.commands.OrderItemAddCmdImpl</name> <name>com.ibm.commerce.orderitems.commands.OrderItemDeleteCmdImpl</name> <name>com.ibm.commerce.orderitems.commands.OrderItemUpdateCmdImpl</name> <name>com.ibm.commerce.orderitems.commands.ExternalOrderItemAddCmdImpl</name> <name>com.ibm.commerce.orderitems.commands.ExternalOrderItemUpdateCmdImpl</name> <name>com.ibm.commerce.order.commands.OrderCancelCmdImpl</name> <name>com.ibm.commerce.order.commands.SetPendingOrderCmdImpl</name> <invalidation>MiniCart:DC_storeId:DC_userId <component type="method" id="getCommandContext"> <method>getStoreId</method> <required>true</required> </component> <component type="method" id="getCommandContext"> <method>getUserId</method> <required>true</required> </component> </invalidation> </cache-entry>
Listing 2. Correct command invalidation approach
<!-- ******************************************************* --> <!-- Start Invalidation rules for Shopping Totals. --> <!-- ******************************************************* --> <!-- We add this rule so the Shopping Totals will always be --> <!-- correct when users modify their cart. This is a --> <!-- invalidation policy that needs to be invalidated based --> <!-- on a shoppers action. --> <!-- ******************************************************* --> <cache-entry> <class>command</class> <sharing-policy>not-shared</sharing-policy> <name>com.ibm.commerce.order.commands.PromotionEngineOrderCalculateCmdImpl</name> <invalidation>MiniCart:DC_storeId:DC_userId <component type="method" id="getCommandContext"> <method>getStoreId</method> <required>true</required> </component> <component type="method" id="getCommandContext"> <method>getUserId</method> <required>true</required> </component> </invalidation> </cache-entry>
Similarly, when you use the scheduled invalidation approach with the CACHEIVL table, you need to avoid duplicate dependencyIDs between the last successful run and the next scheduled run of the invalidation job.
One of the biggest challenges in the scheduled invalidation implementation is to detect the change event without intrusive performance implications, and then generate the dependencyID of the cache entities that are affected by the change. Another important challenge is that the software that makes the merchant data-change (like catalog, price, and inventory) typically does not allow programmatic hooks for making invalidation API calls. The best invalidation strategy is to implement dependency-based invalidation. This option has the advantage of being able to capture change events with less performance impact. You implement dependency-based invalidation with the following process:
- Develop a program to generate dependencyID of all the cache entities, which depend on a specific database table and its data.
- Customize the data load process to make it intelligent enough to derive the dependencyID based on the data that is modified in the database, and then insert the dependencyID of the cache entry to be invalidated in the CACHEIVL table.
- Stage-Prop the invalidation dependencyIDs from the staging server. For details on the Stage-Prop process, see Resources.
- Create selective database triggers in the production database to invalidate the data that is then directly updated to a production system or an emergency update process.
You set the timeouts and inactivity elements of cachespec.xml to a large value. They are not to be used as a mechanism of implementing normal cache invalidation rule. Instead, you implement a time-based invalidation by creating a separate entry in the SCHCONFIG table for the DynaCacheInvalidation Command and configuring it to execute at specific time of the day.
Based on our production implementation experiences, it is also highly recommended that you avoid using the ActivityCleanUpCmd scheduled job, and instead use the WebSphere Commerce database object purge utility (dbclean) as the approach for purging business-context data from database. The challenges of using the ActivityCleanUpCmd and the approach for using DBClean utility are discussed on the ibm.support.com web site, see Resources.
This final installment of the series provides ideas, insights, and best practices for successfully monitoring, collecting, and invalidating data on a grid for WebSphere eXtreme Scale DynaCache. Although the series focuses on WebSphere Commerce environments, most of the techniques are equally applicable to the general domain of WebSphere eXtreme Scale DynaCache. WebSphere eXtreme Scale is a quickly evolving technology. As additional best practices and tools emerge, the authors will to provide updates to the WebSphere eXtreme Scale DynaCache usage series.
The authors are grateful to the following people for technical discussions course the work with WebSphere Commerce and WebSphere eXtreme Scale: Kyle Brown, IBM DE, ISSW; Brian Martin, STSM, WebSphere eXtreme Scale and XC10 Lead Architect; Douglas Berg, WebSphere eXtreme Scale Architect; Chris Johnson, WebSphere eXtreme Scale Architect; Jared Anderson, WebSphere eXtreme Scale Architect; Rohit Kelapure, WebSphere Application Server Development; Joseph Mayo, XC10 Development; Surya Duggirala, WebSphere Application Server Performance Lead; Matt Kilner, JDK L3; Brian Thomson, STSM, WebSphere Commerce Server CTO; Misha Genkin, WebSphere Commerce Server Performance Architect; Robert Dunn, WebSphere Commerce Server Development; Kevin Yu,ISS-IS. The authors would like to acknowledge Mary A. Brooks for a superb job in copy editing. A very special thanks to Cheenar Banerjee for her assistance in proofreading and suggesting several readability improvements.
- Read more about "IBM Extended Cache Monitor for IBM WebSphere Application Server technology preview" (developerWorks, May 2007).
- Read more about "nmon performance: A free tool to analyze AIX and Linux performance" (developerWorks, Nov 2003).
- Read about "WebSphere eXtreme Scale" in the WebSphere eXtreme Scale information center documentation.
- Read more about "Catalog server quorums" in the WebSphere eXtreme Scale information center documentation.
- Read more about "Package com.ibm.websphere.objectgrid.management" in the WebSphere eXtreme Scale information center documentation.
- Read more about "Querying and invalidating data" in the WebSphere eXtreme Scale information center documentation.
- Find the dynacache/scripts/DynaCacheStatisticsCSV.py scripts on the github web site.
- Read more about "All Things WebSphere on the blog.
- Read more about "JR38217:CMVC 206127: Address Performance Problems Caused by the ActivityCleanUp job issuing explicit DynaCache invalidations" (IBM Support Portal, Feb 2011).
- Read more about "Maintaining the Business Context tables using DBClean (IBM Support Portal, Nov 2010).
- Read about "WebSphere Commerce" in the WebSphere Commerce information center documentation.
- Read more about "Cache invalidation" in the WebSphere Commerce information center documentation.
- Read more about "Staging server" in the WebSphere Commerce information center documentation.
- Read more about "DynaCacheInvalidation URL" in the WebSphere Commerce information center documentation.
- In the developerWorks Commerce area, get the resources that you need to advance your knowledge of Commerce products.
- Stay current with developerWorks technical events and webcasts that are focused on various IBM products and IT industry topics.
- Follow developerWorks Commerce communities.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Get products and technologies
- Download IBM Extended Cache Monitor.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
- Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.