This blog promotes knowledge sharing through experience and collaboration. For more product information, visit our WebSphere Commerce CSE page. For easier navigation, utilize the Categories to find posts that match your interest.
Resolve an Outage in 10 Minutes or Less
No, the title is not a marketing scam! The Commerce team recently released a set of tools that can be used to validate configuration and performance. By running reports on data collected during an outage, you can gain insight into the state of each layer, which will often help you uncover the cause of the outage in a matter of minutes. In this post I will not cover the tools in detail, but I will show you an example of how they can be used to resolve an actual site disruption.
The WebSphere Commerce tool catalog
You can use this URL to access all the tools discussed in this post:
You'll be presented with the catalog from which each individual tool can be opened:
Scenario: Troubleshooting a site disruption
The best way to show you how the tools can help diagnose an outage is with an example.
Using the WebSphere Commerce IHS Report
Let's start with the WebSphere Commerce IHS Report. This report reads mpmstats output to create graphs that will help you understand the timeline and severity of an outage.
The mpmstats - Busy WebSphere Plug-in Connections chart shows the number of active connections inside the WebSphere plug-in. This typically maps to the number of web server connections waiting on the Application Servers to respond. Here we can see a big jump in activity from 8:00 to 8:15 PM. This confirms the time of the event and the fact that the web servers are waiting on the application servers to respond.
When the SlowThreshold option is enabled, mpmstats also report the number of connections that have been active for a number of sections that are larger than the specified threshold. Reviewing the mpmstats - Long Running WebSphere Plug-in Connections chart helps confirm that not only are there many connections waiting on the application severs, but also that these connections are not responding in a timely manner.
After reviewing the IHS Report we find the following:
The next step is to see what was happening inside the Commerce servers to understand why they might not be responding.
Using the WebSphere Commerce Health Center Report tool
As the web servers are pointing to a slowness in the Commerce servers, the next step is to use the WebSphere Commerce Health Center Report tool to analyze the activity in these servers.
This report uses data generated by the Java Health Center agent. The agent is installed by default, but depending on your version, you might need to update it. Once enabled, the agent will continuously log performance data such as Garbage Collection, CPU, Native Memory, Threads (every 30 seconds), and more. See the installation docs for more details.
To keep things short, we will skip the CPU and GC charts that I typically review and jump directly into the WebContainer Base Activity report. This report shows WebContainer thread activity. Thread data is collected by the Health Center agent every 30 seconds, and the summary is displayed in the form of a chart.
When reviewing the chart, you can easily see that the WebContainer pool was very busy from 8:00 to 8:13, and most threads are waiting on DB2. Some of the threads are also flagged as hung. These are threads whose stack hasn't changed since the previous sample 30 seconds before.
The report also shows a detailed activity view. While the Base Activity chart only shows the last operation being done, the Detailed Activity chart shows the complete "chain" of activities for each thread.
From here, the original (Javacore-like) stack for each thread can also be accessed . Simply select a thread/time and a popup will display the stack.
The stack shows us that the SQLs are coming from: CatalogEntryGraphComposer.composeCatalogEntryOfferPriceForCurrency().
At this point we know the following:
Next step: Everybody is blaming the database, so let's go talk to it.
Using the WebSphere Commerce DB2 Report tool
The WebSphere Commerce DB2 Report tool uses a .zip created by db2collect.sh to create a report of the database configuration performance. The data in the report is extensive. It includes instance and database configuration, schema, lock analysis, performance, and more. Next, we'll focus on only a few tabs.
The first window that is presented is "Finding and Recommendations" for the report. In this case, the ones that stand out are the cur_commit setting and the number of lock waits.
One of the first charts to look at is "Agent State". The equation is simple: Many threads executing or locked means there is a problem. In this case, it can be clearly seen that from 20:05 to 20:11 there was very heavy lock contention in the database.
This table shows a summary of the lock-waits at every snapshot time (using sysibmadm.mon_lockwaits). The data can be read as follows:
This matches the previous finding that the threads in Commerce were executing SQLs from composeCatalogEntryOfferPriceForCurrency().
The DB2 report shows the following:
Through this data, we figured out that there is a new price update job that was configured to run daily at 8:00 PM. While the job runs, SQLs that read price data are hanging.
Remember that the original Findings and Recommendations tab had a mention to cur_commit not being turned on?
cur_commit works as follows: If a row is being updated (X lock) and a second connection is doing a CS scan, instead of having to wait for the connection holding the lock to commit or rollback, DB2 will return the "currently committed" value, which is the value before it was updated by the connection holding the lock. Then the connection doing the scan doesn't need to wait. This setting is on by default with new databases, but for backward compatibility is kept disabled with migrated databases.
The next recommendation is to implement a transaction size in the utility. Rather than committing the changes a single time at the end of the processing, issuing multiple commits every n records will allow the locks to free up sooner.
Your next steps
If you haven't done so yet, I recommend you check out the demo reports. They are the easiest way to get familiar with the tools. You can find links to the demo reports from each tool's Help tab.
For the IHS Report, mpmstats logging into error_log is enabled by default, but you may want to tweak the configuration to log more frequently and report long running requests (see the IHS - Installation document).
If you are running a recent WebSphere Application Server version, you only need to enable the Health Center agent, otherwise you need to upgrade its libraries (see HC - Installation document). Once enabled and configured, the Health Center agent will continuously log performance data (threads, gc, CPU, etc) into hcd files.
Although the tools are provided as-is, if you run into issues you can post your questions/comments in the forums (you can also find the links in the Help tab).
With all the tools setup, you will be in a much better position to tackle any future site issues. Good Luck!