Tips and techniques for WebSphere eXtreme Scale DynaCache in WebSphere Commerce environments, Part 3: WebSphere eXtreme Scale grid monitoring and collecting statistical data

This final installment of the series focuses on the grid monitoring best practices for the integration of IBM® WebSphere® eXtreme Scale DynaCache in WebSphere Commerce Server environments. WebSphere eXtreme Scale is a distributed caching solution that is a popular provider of DynaCache in large WebSphere Commerce Server environments. WebSphere Commerce Server customers have successfully integrated WebSphere eXtreme Scale DynaCache in large and small production environments. While the configuration of WebSphere eXtreme Scale DynaCache in the WebSphere Commerce Server environment is simple, you need to pay special attention to the best practices for design, usage, operational patterns, and tuning.

Dr. Debasish Banerjee (debasish@us.ibm.com), WebSphere Consultant, IBM

Photo of Dr. Debasish BanerjeeDr. Debasish Banerjee is presently a WebSphere consultant in IBM Software Services. He started his WebSphere career as the WebSphere internationalization architect. Extreme transaction processing, distributed cache, elastic computing, and cloud computing are his current areas of interest. Debasish received his Ph. D. in the field of combinator-based Functional Programming languages.



Ravi Tripathi (ravi.tripathi@us.ibm.com), Managing Consultant, IBM

Ravi Tripathi is an IBM Managing Consultant working on the Smarter Commerce platform in IBM Software Services for Industry Solutions (ISS-IS). In this role, Ravi advises large retail corporations about Smarter Commerce implementation architecture, design, infrastructure, performance, and launch. He is an expert in designing and developing omni-channel, high-performing Smarter Commerce solutions for large retailers in North America. Ravi received his Master's degree in Production Engineering.



Jim Krueger (jim_krueger@us.ibm.com), Advisory Software Engineer, IBM

Jim Krueger is an IBM Advisory Software Engineer working on the WebSphere eXtreme Scale development team. He is the lead WebSphere eXtreme Scale Dynamic Cache developer. In this role, he frequently advises large corporations about WebSphere eXtreme Scale technology. Before joining WebSphere eXtreme Scale, Jim was a member of the WebSphere Application Server EJBContainer development team.



Anupam Basu (anupam@us.ibm.com), IT Architect, IBM

Anupam Basu is an certified IT architect with IBM Software Group and helps customers implement e-commerce solutions with IBM Smarter Commerce and the IBM Middleware portfolio of products. He has extensive experience in designing and developing enterprise architectures. He helped build several high-volume and high-performing Smarter Commerce solutions for large retailers in North America. He earned double Master's degree in Statistics and Computer Science from Indian Statistical Institute, Calcutta, India.



22 January 2013

Introduction to grid monitoring and collecting statistical data

In any environment, it is important to monitor the health of the WebSphere eXtreme Scale grid and collect statistical information about grid usage. These are a few simple strategies that can be used to monitor the overall health of the grid and generate operational alerts when necessary.

CPU usage

The CPU usage of catalog servers is minimal when catalog containers are in a steady state. A steady state for catalog containers includes:

  • No containers being added or removed from the grid.
  • No new WebSphere eXtreme Scale clients trying to connect to the WebSphere eXtreme Scale grid.

It is for this reason that the CPU usage in catalog servers has a lower priority for monitoring. You can set the operational alerts generation that is based on any user CPU spike on catalog server machines of 85% or more.

The container servers, however, are the work horses of any WebSphere eXtreme Scale grid, and the CPU usage of container servers needs to be closely monitored. Although there is not a precise and specific formula, you can generate a medium severity (yellow) operational alert when the user CPU in any WebSphere eXtreme Scale container goes beyond 65%. If the user CPU in a WebSphere eXtreme Scale container climbs to 85% or higher, a high severity (red) operational alert can be generated.

The CPU percentages in WebSphere eXtreme Scale servers can be monitored by nmon or other similar tools. For more information about nmon, see the "nmon performance: A free tool to analyze AIX and Linux performance" developerWorks article.

Memory size

To avoid memory issues, a grid is conservatively sized based on the expected amount of data to be cached. However, even a conservative estimate can prove to be inadequate in real life. Sometimes, real-life data access patterns cannot be accurately predicted beforehand, and the user heap space in each of the container JVMs should be monitored.

Ideally, the heap usage in each container JVM is under 60%. A yellow alert should be generated if any container JVM user heap space goes beyond 60% but stays within 75%. User heap-space usage exceeding 75% on any container JVM should generate a red alert. However, there is no real need to monitor the heap usage on the catalog servers.

Also, verbose_gc files are generated on the container JVMs at convenient intervals to get a feel for the average heap space usage, garbage collection frequency, garbage collection pause time, and other relevant information.


Monitoring with the server-side flight recorder

The xsadmin utility for WebSphere eXtreme Scale (V7.1 and earlier) and the xscmd utility for WebSphere eXtreme Scale (V7.1 and later) can be used to collect important grid attributes and statistical data at run time. With either of these two utilities, you can develop and deploy simple scripts as server-side flight recorders for monitoring and logging relevant grid attributes at periodic intervals. Also, depending on the values of the certain monitored grid attributes, the scripts also generate operational alerts if necessary. The following grid attributes can be monitored by the server-side flight recorder in real-life WebSphere eXtreme Scale DynaCache environments.

Containers and shard distribution

The xscmd -c showPlacement command (or its equivalent xsadmin –containerscommand) outputs the distribution of shards among the live WebSphere eXtreme Scale container servers. As explained earlier in "Tips and techniques for WebSphere eXtreme Scale DynaCache in WebSphere Commerce environments, Part 2: WebSphere eXtreme Scale grid sizing and configuration," in any WebSphere eXtreme Scale grid, the number of containers and shards is known beforehand. If the total number of containers in the command output is less than the expected number, the grid is working in an impaired condition, which results in fewer than the expected number of containers. An alert should be generated for the administrator to investigate and restart the dead containers.

For grids with replicas, also watch for the approximate uniform distribution of primary and replica shards among the WebSphere eXtreme Scale containers. If the output of the commands shows a shard imbalance, then the xscmd –c balanceShardTypes command (or a number of appropriate xsadmin –swapShardWithPrimarycommands) can be issued to balance the grid.

State of partitions

The xscmd –c routetable command (or its equivalent xsadmin –routetable command) outputs the states of all the shards. The state of a shard can be either reachable or unreachable. Alerts can be generated if the number of partitions does not match the numberOfPartitions attribute that is specified in the grid deployment file, or if one or more shards remains unreachable for some time (half a minute or more). After the alert generation for an unreachable shard, three Java™ cores at one-minute intervals are taken from the corresponding container JVM for later analysis, along with other system monitoring data.

NOTE: During internal state transitions, a shard can legitimately be in unreachable state for a short instance of time. It is rare to encounter a shard that remains permanently in an unreachable state. However, if you encounter such a shard, collect all the diagnostic information. Then the owning WebSphere eXtreme Scale containers can be restarted to attempt to resolve the situation. If a container restart fails to address the issue, you will have to restart all the containers.

Quorum

In some environments, customers might decide to turn on the quorum for catalog servers. This prevents the splitting of the WebSphere eXtreme Scale grid into multiple independent grids. For more information about communication glitches between catalog servers, refer to the "Catalog server quorums" topic in the WebSphere eXtreme Scale information center, see Resources.

If the quorum flag is set, monitor the output of the xscmd -c showQuorumStatus command (or xsadmin –quorumStatuscommand). If the quorum is broken, the active server count in the command output is less than the number of catalog server cluster members. If the quorum remains broken for a long time (5 minutes), an alert should be generated for the administrator to take necessary actions after careful analysis.

Cache size

The xscmd –c showMapSizes command (or its equivalent xsadmin mapsizes command) outputs important cache statistical information. This information includes the number of elements that are cached in each partition, in each container, and in the entire grid, along with the amount of heap space that is occupied by these cached entities. The space calculation includes most of the overhead for storing the cached entities in the WebSphere eXtreme Scale grid. By monitoring the total number of cached entities and their combined size over time, users can obtain excellent information about important operational and sizing attributes, like cache growth rate over time, peak cache growth period, cache growth during special promotion period, and average object size. Based on this information, e-commerce application architects can fine-tune the application caching strategies.

WebSphere eXtreme Scale exposes a number of MBeans, and with any available JMX-compliant monitoring tool, you can implement a server-side flight recorder for monitoring, statistics collection, and alert generation. For more information, refer to the "Package com.ibm.websphere.objectgrid.management" topic in the WebSphere eXtreme Scale information center documentation, see Resources.


Monitoring with the client-side flight recorder

Traditional DynaCache can be viewed, monitored, and managed from the GUI interface of the Cache Monitor utility. For WebSphere eXtreme Scale DynaCache, you might use a similar utility, or a special version of the Extended Cache Monitor (ECM) JEE application. Refer to the "IBM Extended Cache Monitor for IBM WebSphere Application Server technology preview" developerWorks article for utility details.

You must not use a wildcard query for WebSphere eXtreme Scale DynaCache using the ECM. Such a query might require a restart of the JVM from where the query was issued. For this reason, it is recommended that you update the cache monitor with the cachemonitor7_package_for_WXS85.zip file from the IBM Extended Cache Monitor web site, see Resources.

NOTE: The name of the zip file is misleading. The update works on any WebSphere eXtreme Scale version, not just WebSphere eXtreme Scale 8.5.

All the update instructions that are detailed in the "IBM Extended Cache Monitor for IBM WebSphere Application Server technology preview" developerWorks article must be properly followed to create a safe version of the ECM that does not issue a wildcard query.

The ECM can be installed in the entire commerce cluster. As a good operational practice, the ECM is used only from one or two designated JVMs in the cluster. With ECM, you can look at the servlet cache instances. To look at the servlet cache instances, you have to add the JVM custom property com.ibm.ws.cache.CacheConfig.showObjectContents and set its value to "true" in the JVMs that are used to execute the ECM application.

The WebSphere eXtreme Scale console server (V8.5 or later) provides a powerful facility to query the keys of the cached objects using regular expressions. Based on the results of the query, you can individually invalidate one or more items in the cache. For details, refer to the "Query and invalidating data" topic in the WebSphere eXtreme Scale information center, see Resources.

The ECM in WebSphere eXtreme Scale DynaCache environments is used to track the rate of least recently used (LRU) evictions from DynaCache. Ideally, the rate of LRUs is zero to a small amount. A consistently high LRU eviction rate can indicate a poorly sized WebSphere eXtreme Scale grid, a poorly designed cache strategy, or both. Based on the LRU eviction rate, the server-side statistics of number of cached entities, and the server-side statistics size, you might have to appropriately increase the grid capacity or fine-tune the application caching strategy or both.

The ECM displays the LRU eviction on a GUI. To determine the rate of LRU eviction, you have to watch the LRU evictions field in the ECM GUI. As an alternative, you can deploy a client-side flight recorder to periodically collect DynaCache MBean statistics and place them in a CSV formatted file. DynaCacheStatisticsCSV.py, the main flight recorder code, can be obtained from the github.com web site, see Resources. For more information, refer to the "All Things WebSphere" blog and the prologue of the DynaCacheStatisticsCSV.py for sample usage of the powerful flight recorder script, see Resources. The client-side flight recorder data is useful in debugging and tuning client applications using DynaCache.

Both the client-side and the server-side flight recorder are used in load test environments and also in production. In a production environment, initially they are executed every 15 minutes. Later as the system stabilizes, they might be executed at a much longer interval, say every half an hour or 1 hour.


Using multi-data-center environments

The use of multiple data centers in active-passive mode for disaster recovery is fairly common in enterprise IT environments. WebSphere Commerce is used in multi-data-center installations, and is often in active-active mode sharing a backend database. Each data center typically contains one WebSphere Commerce cell. The WebSphere Commerce application that is deployed in the cells of all the data centers handles user traffic. Then, the traffic to each data center is workload that is managed by a standard IP sprayer.

For such a topology, it is recommended that you create multiple independent WebSphere eXtreme Scale DynaCache grids, one in each data center. Try not to create a single WebSphere eXtreme Scale DynaCache grid spanning multiple data centers. Also, for ease of configuration and maintenance you can create identical grids of the same name in each data center.


Invalidating cache

WebSphere Commerce provides a wide range of cache invalidation approaches. For more detail on these approaches, refer to the "Cache invalidation" topic on the WebSphere Commerce information center, see Resources.

In the context of WebSphere Commerce solution with WebSphere eXtreme Scale, we typically use:

  • Command-based invalidation - This type of invalidation is triggered when a specific business-logic command executes in the WebSphere Application Server container and a cached entity is associated to it by the invalidation policies defined in the cache-Rule file (cachespec.xml).

    Such types of invalidations are typically triggered by a Guest action on the web site or, a Business User changing a business rule in the WebSphere Commerce staging environment (for example, using commerce management center business interface). For details on the staging environment, refer to the "Staging server" topic on the WebSphere Commerce information center, see Resources.

  • Scheduled invalidation - This type of invalidation is used when the data-state changes in WebSphere Commerce database. This is the preferred approach for merchant data changes on the web site, and it follows a pattern that is similar to a publish-subscribe design pattern. The data-change and corresponding dependencyId of the cache entry to be invalidated are published to the WebSphere Commerce CACHEIVL database table, when the subscriber to this event runs periodic checks on it and performs the appropriate invalidation in the cache provider.

    The key idea behind this approach is that the cached content in the cache provider (WebSphere eXtreme Scale) can be uniquely identified by dependency IDs (referred to as DID from here onwards). The same ID if entered in the CACHEIVL would be picked up by the WebSphere Commerce invalidation job to perform the invalidation in WebSphere eXtreme Scale. This cache invalidation scheduler command typically runs every 10 minutes, reads all the records from the CACHEIVL table, and makes the DynaCache invalidation API call with the specified DIDs. It is highly recommended that you create a specific commerce instance and cluster for performing scheduler-based invalidations.

Figure 1. Invalidation process flow
Invalidation process flow

WebSphere Commerce cache invalidation recommendations

A fine-grained invalidation approach is recommended because it evicts the dirty cache-entity only and allows the other unrelated cache-entities to remain in the cache. The solution determines the affected cache-entity dependencyIDs (DIDs) and uses those to perform precise invalidation of the cached content, rather than clearing all the cached entities. An invalidation event is not idempotent and cheap; you need to ensure that during a transaction, only one invalidation event occurs for a given dependencyID. You also need to implement the cache invalidation process based on the dynamic cache invalidation approach of the WebSphere Commerce scheduler framework. This framework invalidates specific cache entities based on the comparison of the CACHEIVL database table's DATAID value to the cache entity dependencyID value. For more information on the cache invalidation framework in WebSphere Commerce, refer to the "DynaCacheInvalidation URL" topic on the WebSphere Commerce information center, see Resources.

Command invalidation is implemented with one of the following options to avoid duplicate invalidation messages.

  • Mention only one command in the cachespec.xml for issuing the invalidation call per unique transaction.
  • Modify the WebSphere Commerce Command class to issue the programmatic invalidation when the specific user transaction is encountered. This is best implemented by extending the processExecute method of the concerned command (for example: PromotionEngineOrderCalculateCmdImpl) and then issuing the invalidation towards the end of the method body. If you use this approach, you need to set the delayed invalidation attribute to true in the invalidation API.

In Listing 1 and Listing 2, the system is expected to update the mini-cart total based on the user action. In the correct approach of Listing 2, only one command is mentioned in cachespec.xml that issues the invalidation in all the possible user interactions. This one command for all user interactions avoids unnecessary invalidation traffic.

Listing 1. Incorrect command invalidation approach
<!-- ******************************************************* -->
<!-- Start Invalidation rules for Shopping Totals. -->
<!-- ******************************************************* -->
<!-- We add this rule so the Shopping Totals will always be -->
<!-- correct when users modify their cart.	This is a	-->
<!-- invalidation policy that needs to be invalidated based -->
<!-- on a shoppers action.					-->
<!-- ******************************************************* -->
<cache-entry>
<class>command</class>
<sharing-policy>not-shared</sharing-policy>
<name>com.ibm.commerce.order.commands.OrderCalculateCmdImpl</name>
<name>com.ibm.commerce.order.commands.PromotionEngineOrderCalculateCmdImpl</name>
<name>com.ibm.commerce.orderitems.commands.OrderItemMoveCmdImpl</name>
<name>com.ibm.commerce.usermanagement.commands.UserRegistrationAddCmdImpl</name>
<name>com.ibm.commerce.usermanagement.commands.UserRegistrationUpdateCmdImpl</name>
<name>com.ibm.commerce.order.commands.OrderProcessCmdImpl</name>
<name>com.ibm.commerce.orderitems.commands.OrderItemAddCmdImpl</name>
<name>com.ibm.commerce.orderitems.commands.OrderItemDeleteCmdImpl</name>
<name>com.ibm.commerce.orderitems.commands.OrderItemUpdateCmdImpl</name>
<name>com.ibm.commerce.orderitems.commands.ExternalOrderItemAddCmdImpl</name>
<name>com.ibm.commerce.orderitems.commands.ExternalOrderItemUpdateCmdImpl</name>
<name>com.ibm.commerce.order.commands.OrderCancelCmdImpl</name>
<name>com.ibm.commerce.order.commands.SetPendingOrderCmdImpl</name>

<invalidation>MiniCart:DC_storeId:DC_userId
<component type="method" id="getCommandContext">
<method>getStoreId</method>
<required>true</required>
</component>
<component type="method" id="getCommandContext">
<method>getUserId</method>
<required>true</required>
</component>
</invalidation>
</cache-entry>
Listing 2. Correct command invalidation approach
<!-- ******************************************************* -->
<!-- Start Invalidation rules for Shopping Totals. -->
<!-- ******************************************************* -->
<!-- We add this rule so the Shopping Totals will always be -->
<!-- correct when users modify their cart.	This is a	-->
<!-- invalidation policy that needs to be invalidated based -->
<!-- on a shoppers action.					-->
<!-- ******************************************************* -->
<cache-entry>
<class>command</class>
<sharing-policy>not-shared</sharing-policy>
<name>com.ibm.commerce.order.commands.PromotionEngineOrderCalculateCmdImpl</name>

<invalidation>MiniCart:DC_storeId:DC_userId
<component type="method" id="getCommandContext">
<method>getStoreId</method>
<required>true</required>
</component>
<component type="method" id="getCommandContext">
<method>getUserId</method>
<required>true</required>
</component>
</invalidation>
</cache-entry>

Similarly, when you use the scheduled invalidation approach with the CACHEIVL table, you need to avoid duplicate dependencyIDs between the last successful run and the next scheduled run of the invalidation job.

One of the biggest challenges in the scheduled invalidation implementation is to detect the change event without intrusive performance implications, and then generate the dependencyID of the cache entities that are affected by the change. Another important challenge is that the software that makes the merchant data-change (like catalog, price, and inventory) typically does not allow programmatic hooks for making invalidation API calls. The best invalidation strategy is to implement dependency-based invalidation. This option has the advantage of being able to capture change events with less performance impact. You implement dependency-based invalidation with the following process:

  1. Develop a program to generate dependencyID of all the cache entities, which depend on a specific database table and its data.
  2. Customize the data load process to make it intelligent enough to derive the dependencyID based on the data that is modified in the database, and then insert the dependencyID of the cache entry to be invalidated in the CACHEIVL table.
  3. Stage-Prop the invalidation dependencyIDs from the staging server. For details on the Stage-Prop process, see Resources.
  4. Create selective database triggers in the production database to invalidate the data that is then directly updated to a production system or an emergency update process.

You set the timeouts and inactivity elements of cachespec.xml to a large value. They are not to be used as a mechanism of implementing normal cache invalidation rule. Instead, you implement a time-based invalidation by creating a separate entry in the SCHCONFIG table for the DynaCacheInvalidation Command and configuring it to execute at specific time of the day.

Based on our production implementation experiences, it is also highly recommended that you avoid using the ActivityCleanUpCmd scheduled job, and instead use the WebSphere Commerce database object purge utility (dbclean) as the approach for purging business-context data from database. The challenges of using the ActivityCleanUpCmd and the approach for using DBClean utility are discussed on the ibm.support.com web site, see Resources.


Conclusion

This final installment of the series provides ideas, insights, and best practices for successfully monitoring, collecting, and invalidating data on a grid for WebSphere eXtreme Scale DynaCache. Although the series focuses on WebSphere Commerce environments, most of the techniques are equally applicable to the general domain of WebSphere eXtreme Scale DynaCache. WebSphere eXtreme Scale is a quickly evolving technology. As additional best practices and tools emerge, the authors will to provide updates to the WebSphere eXtreme Scale DynaCache usage series.


Acknowledgments

The authors are grateful to the following people for technical discussions course the work with WebSphere Commerce and WebSphere eXtreme Scale: Kyle Brown, IBM DE, ISSW; Brian Martin, STSM, WebSphere eXtreme Scale and XC10 Lead Architect; Douglas Berg, WebSphere eXtreme Scale Architect; Chris Johnson, WebSphere eXtreme Scale Architect; Jared Anderson, WebSphere eXtreme Scale Architect; Rohit Kelapure, WebSphere Application Server Development; Joseph Mayo, XC10 Development; Surya Duggirala, WebSphere Application Server Performance Lead; Matt Kilner, JDK L3; Brian Thomson, STSM, WebSphere Commerce Server CTO; Misha Genkin, WebSphere Commerce Server Performance Architect; Robert Dunn, WebSphere Commerce Server Development; Kevin Yu,ISS-IS. The authors would like to acknowledge Mary A. Brooks for a superb job in copy editing. A very special thanks to Cheenar Banerjee for her assistance in proofreading and suggesting several readability improvements.

Resources

Learn

Get products and technologies

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Commerce on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Commerce, WebSphere
ArticleID=855842
ArticleTitle=Tips and techniques for WebSphere eXtreme Scale DynaCache in WebSphere Commerce environments, Part 3: WebSphere eXtreme Scale grid monitoring and collecting statistical data
publish-date=01222013