Skip to main content

Solving performance degradation problems in WebSphere applications

Ting Lou (louting@cn.ibm.com), Software Engineer, IBM
Author1 photo
Ting Lou is a Software Engineer at the IBM China Software Development Lab, Beijing, China. He works on IBM WebSphere Commerce system testing.
James Tang (mfjtang@ca.ibm.com), Software Developer, IBM
Author photo
James Tang is a Software Developer at the IBM Toronto Lab, Ontario, Canada. He has been a lead developer of the cache component of WebSphere Commerce. He is currently working as an Advisory I/T Specialist for IBM Software Services for WebSphere.

Summary:  Learn how to diagnose WebSphere® Commerce throughput problems during system verification testing (SVT) and how to solve them to improve performance.

Date:  20 Jun 2007
Level:  Intermediate
Activity:  423 views

Introduction

WebSphere Commerce is a leading e-Commerce product that is becoming more popular all over the world. WebSphere Commerce is a complex Web application that runs on top of the WebSphere Application Server. As WebSphere Commerce developers and SVT testers, troubleshooting performance problems is one of the most important tasks. This article focuses on the analysis of throughput degradation problems for WebSphere applications and provides guidelines that have been proven to be effective and efficient from our work experience. The article also describes a general methodology for diagnosing WebSphere throughput degradation problems found in SVT. It also provides suggestions on how to solve them to improve performance. It contains three main sections:

  • How to identify throughput degradation in SVT: This section introduces the main indicators of throughput degradation found in SVT.
  • How to analyze and solve throughput degradation: This section introduces a general methodology on how to deal with throughput degradation, explains the detailed working process of analyzing throughput degradation, and provides possible solutions.
  • Example of throughput degradation analysis and solution: This section takes you through a throughput degradation example from our actual work to demonstrate this methodology.

Throughput problem determination

In SVT testing, testers may encounter throughput degradation problems that seriously impact the application's performance. Identifying the throughput degradation problems and analyzing them to come up with the corresponding solution is an important task for WebSphere application developers and testers. With these throughput degradation problems solved, the performance of the WebSphere application improves significantly.

Identifying throughput degradation in SVT

This section introduces the main characteristic of throughput degradation that can help you identify problems as early as possible in SVT for WebSphere applications. Gradual throughput degradation is a systemic problem you may encounter in SVT testing. The main problem is during the test interval (for example, 3 hours), the throughput is decreasing gradually and the response time of WebSphere application is getting longer and longer. You can easily find this trend in the report of our stress test tool. The profiles of the transaction, throughput/concurrency, and response time curves are shown in Figure 1.


Figure 1. Throughput degradation problems in the stress test tool report
Figure 1. Throughput degradation problems in the stress test tool report

Analyzing and solving throughput degradation

This section summarizes our throughput analysis methodology, and then gives a detailed working process on how to analyze and solve the throughput degradation problems.

Throughput analysis methodology

In our experience, the main causes of throughput degradation are code issues, database issues, and test data or method issues. We can use our test report to identify throughput degradation easily, but finding its root cause requires thorough investigation.

Here we summarize a throughput analysis methodology based on our experience for WebSphere Commerce. You can use it mainly to analyze the following two kinds of problems: throughput lower than target or gradual throughput degradation. This article focuses on the latter problem. Figure 2 is the scheme of our throughput analysis methodology.

Here is a brief explanation of our methodology:

  • The methodology does not cover all database problems. Generally, database problems can refer to any bottleneck due to database, including SQL queries, database tuning parameters, indexing issues, and data distribution problems.
  • The flow chart assumes that a previous performance baseline has been established. To set a throughput baseline for a test case, and if the case has been tested in a previous release, we often use its previous result as our baseline. If the case is a new case, we often use the final actual test result in the new release to set up our baseline for future comparison.
  • In Figure 2, "Divide & conquer" means running with smaller scenarios, such as homepage only, logon only, and browse only, rather than an end-to-end run.
  • Cost/SQL is the execution cost per number of executions for that SQL query. Fetch time is not recorded in execution costs.
  • Access plans can change based on accumulated data. You can use DB2® Explain utilities to find more clues.


Figure 2. Throughput analysis methodology
Figure 2. Throughput analysis methodology

For a detailed description of each step, see the section below, "Throughput degradation analysis and solution".

Throughput degradation analysis and solution

When encountering a throughput problem in SVT, follow these steps to analyze and solve the problem. Testers should identify whether the problem is a pure low throughput problem, or a throughput degradation problem. The main identification method is to check whether there is a down trending for Throughput/Concurrency charts in the stress test reports during the test. If so, it is a gradual throughput degradation problem as shown in Figure 1. Otherwise, it is a pure low throughput problem.

You can start analyzing the problem using this detailed process:

  1. Check to see whether all WebSphere application commands degrade when compared against the baseline result. You can do this by checking the average response time of all the commands in our test report. If only certain commands are slow, it usually means a design problem or code issue, and you can use Jinsight or an equivalent Java™ profiler to pinpoint the culprit in the code.
  2. Check to see whether throughput degradation exists after restarting WebSphere Application Server. If the problem is resolved after a server restart, it probably relates to a non-database specific problem. The most likely type is a memory problem, such as memory leak, heap fragmentation, or large object allocation. In some cases, the problem is caused by WebSphere Application Server latent issues and you need to involve its service team. However, if you cannot solve the problem by restarting the server, go to Step 3 to continue the analysis.
  3. Check to see whether the database has been optimized. For DB2, check whether runstats has been run on the DB server. If not, start runstats. Runstats is important to improve DB2 performance when the data volume is large, or the system has been running for a long time. Runstats can also help to optimize the DB2 access plan, which makes DB2 more efficient. For Oracle®, you can optimize database performance with the following command: execute dbms_utility.analyze_schema ('schema_name', 'COMPUTE'). This article mainly uses DB2 and runstats as our example.
  4. Check to see whether DB tuning has been done. If not, try to tune DB2 parameters. The available tuning objectives include bufferpools, sortheap, and locks. If the problem persists after DB tuning, there are two possible problems:
    • If the throughput is not gradual degradation, examine the DB2 snapshot file to analyze the status of the top SQL queries (the number of executions and costs of each query), and to find which query is causing the problem.
    • If the throughput is a gradual degradation (the focus of this article), go to Step 5 to continue the analysis.
  5. Check to see whether the data is evenly distributed. If not, fix the data problem. Unevenly distributed data is created by improper warming up or unbalanced operation during the test. For example, the tester uses some fixed users to place an order in the warmup, which creates thousands of orders related to these users in the database. On the other hand, if the tester only uses some fixed users to place orders in the formal testing, the corresponding data will accumulate. This makes DB2 queries use table scans instead of index scans and degrades database performance. The better method is to omit the warmup users from the formal test and select the random users from the bigger user scale. For example, select 20 users randomly from 400 users to do both warm up and formal testing.
  6. For divide and conquer, try to use the smallest scenarios to reproduce the problem. This can isolate the scale of the possible causes of throughput degradation. To accomplish this, divide the test scenarios into different groups and test separately, find the groups that caused throughput degradation, then divide those groups again and again until you narrow down to the minimum scenario group that caused the problem.
  7. Run the scenarios confirmed in Step 6 for a long time and take multiple snapshots during the test. For example, take a 10-hour snapshot separately in the first and second day, and then compare these two snapshots.
  8. Compare multiple snapshots to see whether the Cost/SQL and execution number of SQL queries are growing. If yes, purge the data and tune the index if needed. Through comparing these files, you can identify the top growing cost queries. The cost can be one of the following metrics: execution time per query execution, user CPU time per query execution, and system CPU time per query execution. Note that other costs such as "fetch time" are not included in the execution time reported by the snapshot. Notice that you must search the same SQL query in multiple snapshots to find what is "growing".
  9. If the Cost/SQL entries are all constant, compare the snapshots to see whether rows read/execution is growing for some SQL queries. If so, try to tune the corresponding index to improve the performance. If not, analyze the access plan (for example, using DB2 Explain utilities) to see which can be amended.

In Steps 8 and 9, you can identify the top queries that have growing Cost/SQL or growing rows read per execution. Usually these two characteristics of identified queries are the main indicators for performance degradation. To solve these problems, drop the extraneous index, add a new index, or periodically clean out the large volume of obsolete data in some tables. If there are no growing cost SQL queries in the snapshot, analyze the access plan to see which can be amended based on the accumulated data. Figure 3 is an example that reflects the statistics of rows read/execution per SQL through comparing snapshots. We made this chart by comparing two snapshots in a 4-day test. One is a 10-hour snapshot taken at Day 2 and the other is a 10-hour snapshot taken at Day 4. The numbers in red mean corresponding queries have growing costs and need to be amended.


Figure 3. Example of identifying growing cost SQL queries
Figure 3. Example of identifying growing cost SQL queries

Example of throughput degradation analysis in WebSphere Commerce SVT

This section introduces an example of throughput degradation we actually encountered in WebSphere Commerce SVT. The test was run on an AIX platform in a 1-tier test environment. DB2, IBM HTTP Server, and WebSphere Application Server are installed on the same machine.

This throughput degradation problem occurred in a 3-hour stress comparison test. Based on the test result from the previous release, we set the throughput baseline at 11580 transactions per hour and estimated a 10% performance improvement in the new release. The test target of this case was set at 12738 transactions per hour.

Finding throughput degradation

In the first run, we checked the report of our stress test tool. The throughput curve looked fairly straight and the degradation was not obvious at first. The rate of degradation was around 5% per hour. However, with such a degradation rate, in a few days the application cannot handle any load with reasonable response times. Figure 4 shows the throughput chart in the stress test report.


Figure 4. Find throughput degradation in stress test tool report
Figure 4. Find throughput degradation in stress test tool report

Analyzing and solving throughput degradation

This was a gradual throughput degradation problem. We followed Steps 1 to 3 to analyze the problem and got the following results:

  • All WebSphere Commerce commands were degraded, especially commands corresponding to orders.
  • When analyzing native_stderr.log, we found that the GC cycles were fine, and there were no memory issues, such as memory leak, heap fragmentation, and large object allocation.
  • After restarting the server, the throughput degradation still existed.
  • Runstats and common DB2 tuning have been done.

In Step 4, we doubled SORTHEAP to 2048 to avoid sort overflows because some queries needed to create temporary tables.

In Step 5, when checking the database we found some data issues. For example, after running this query, select member_id, count(*) from orders group by member_id having count(*) > 50, we got the result shown in Figure 5.


Figure 5. Data issue found in the database
Figure 5. Data issue found in the database

The data in Figure 6 indicates that four users had more orders than other users. After reviewing our test steps, we found the causes of this problem were:

  • In the warmup, we often used 4 fixed users to run scenarios.
  • There were 200 users in the database, but only 20 users were selected to run scenarios in the formal test.

The above two factors made the data distributed unevenly in the database, so our solution was in two parts:

  • Increase the number of users from 200 to 400.
  • In the warmup and formal test, randomly select each virtual user from 20 non-overlapped users. After these changes, the data issues did not occur.

In Step 6, we narrowed the scenarios to order the shopping flow.

In Step 7, we kept the test running for 4 days and took two 10-hour snapshots separately on Day 2 and Day 4.

In Step 8, after comparing the two snapshots, we found no growing costs/execution for any query.

In Step 9, after comparing the two snapshots, we identified the top SQL queries had growing rows read per execution, shown in red in Figure 6.


Figure 6. Top SQL with growing rows read per execution
Figure 6. Top SQL with growing rows read per execution

Identifying these "growing" SQLs gave us clues to the solution of this throughput degradation problem. Throughput had increased based on these events:

  • We dropped extraneous index MEMBER_ID+TYPE+STOREENT_ID on the ORDERS table, so queries will use the right index MEMBER_ID+STATUS+STOREENT_ID.
  • We created the CHECKED index for the BUSEVENT table.
  • We periodically cleaned the CTXMGMT/BUSEVENT table in the test.

Figure 7 shows the same SQL execution status after we applied the modifications just mentioned. Most of the growing rows read per execution have been solved.


Figure 7. Growing rows read per execution solved
Figure 7. Growing rows read per execution solved

After fixing the growing SQL costs, the throughput seemed stable for the first 3 hours as shown in Figure 8.


Figure 8. Throughput was stable in the first 3 hours
Figure 8. Throughput was stable in the first 3 hours

However, the degradation still existed in the long run as shown in Figure 9.


Figure 9. Throughput degraded in the long run
Figure 9. Throughput degraded in the long run

Then we analyzed the access plan and decided to do runstats and rebind the database during a long-running test. Figure 10 and Figure 11 show how runstats helped to optimize the access plan. In this example, access plan overall costs do not include the actual fetch time.


Figure 10. Access plan before runstats
Figure 10. Access plan before runstats

Figure 11. Access plan after runstats
Figure 11. Access plan after runstats

After we did runstats and a rebind of the database in the middle of a long-running test, the overall throughput had been stabilized again and the throughput degradation problem was solved successfully. Therefore, running runstats and rebinding the database need to be done regularly to ensure that database indexes are not out of date.

Conclusion

This article is a summary of our experience analyzing and solving throughput degradation problems in WebSphere Commerce SVT. It described what to do when encountering a throughput degradation problem and how to analyze DB2 snapshots using time costs and rows read per SQL execution. Throughput degradation is a complex problem that relates to DB2 and SQL performance tuning, and WebSphere Application Server performance diagnosis, program design, and coding. This involves the cooperation of testers and developers working together to efficiently and effectively analyze and solve these problems, and to improve performance.


Resources

Learn

Discuss

About the authors

Author1 photo

Ting Lou is a Software Engineer at the IBM China Software Development Lab, Beijing, China. He works on IBM WebSphere Commerce system testing.

Author photo

James Tang is a Software Developer at the IBM Toronto Lab, Ontario, Canada. He has been a lead developer of the cache component of WebSphere Commerce. He is currently working as an Advisory I/T Specialist for IBM Software Services for WebSphere.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=232092
ArticleTitle=Solving performance degradation problems in WebSphere applications
publish-date=06202007
author1-email=louting@cn.ibm.com
author1-email-cc=dwu@us.ibm.com
author2-email=mfjtang@ca.ibm.com
author2-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers