IBM WebSphere performance tuning and IBM Tivoli Monitoring
Best practices for a POWER7 WebSphere Application Server 7 infrastructure
Increased transactions per second (TPS) requires tuning for response times, as well as, lower CPU utilization for financial transactions and web-based transactions. TPS directly affects user response times and the cost of your hardware infrastructure. This article shows how to tune IBM WebSphere Application Server 7 on IBM POWER7® hardware for better performance.
The hamburger service shown in Figure 1 describes the IBM POWER profile configuration of POWER7 LPARs for the allocation of resources. Each request controller LPAR has four physical processors (eight logical CPUs) with 8GB of RAM to process the incoming workload accordingly. Other services—for example, the onion service profile—are configured with two physical processors (four logical CPUs) and 8GB of RAM. Whether certain services have more processors is determined by the load requirements of that specific service. In this case, the request controller service is called for each request, where the cheese and onion services are not.
The application layer refers to the internal layer which includes a request controller service to process service-based transactions, as well as, a list of other WebSphere Application Server version 7-based Apache Tuscany (Open SCA) services. In this example, the application creates a hamburger. Based on customers' requests, specific services are executed (for example, hamburger grill service, lettuce service, onion service, packaging service, tomato service, and cheese service).
Figure 1. The hamburger service
Topologies for systems monitoring and management include IBM Tivoli Monitoring, where monitoring data is consolidated for performance views and management. The Tivoli topology includes IBM Tivoli Enterprise Monitoring Server as the primary hub for all systems monitoring, a backup Tivoli Enterprise Monitoring Server in case the primary fails, IBM Tivoli Enterprise Portal Server and the number of remote servers to handle, as well as, load balancing the number of systems monitoring agents.
Figure 2 shows the configuration of each instance within a POWER7 LPAR with one physical processor (two logical CPUs) and 4GB of RAM. This configuration covers infrastructures for up to 500 servers for the collection of systems monitoring data.
Figure 2. The Tivoli Monitoring version 6.2 architecture
Figure 3 shows the topology diagram for IBM Tivoli Composite Application Manager for Application Diagnostics version 7.1 to collect application performance data and diagnostics tools for performance tuning. The LPAR-generated distribution of the Tivoli Composite Application Manager services the visualization engine, kernel, publish server, global publish server, message dispatcher, archive agent. Publish server provides scalability and availability for distributed application profiling.
Figure 3. Tivoli Composite Application Manager version 7.1 architecture
Performance requirements for transaction response times (that is, TPS) include the median response time (150ms), the average response time (170ms), the response time 95th percentile (180ms), and 20-35 percent CPU utilization on a single POWER7 core.
You implement vertical scaling by adding multiple Java™ Virtual Machine (JVM) instances on a single LPAR leveraging the same processor and memory. You can leverage this architecture if you tune your application to such a level of serialization that running multiple JVMs can increase your workload. Use this configure if LPAR processors and memory are still available.
You implement horizontal scaling by adding multiple LPARs, each with one or many instances of WebSphere Application Server from the same service or application clusters. (This configuration is best used for processor- and memory-intensive services.) In Figure 1, the request controller invoked for each service request is an example of horizontal scaling, because controller patterns seem to be processor intensive and memory is determined by the payload of each service request.
Usage and performance management best practices
This section describes tools you can use to determine where performance bottlenecks in the transaction are located.
Tivoli Composite Application Manager provides many tools and graphs for monitoring your infrastructure:
- JVM CPU utilization graph: This Tivoli Composite Application Manager graph (see
Figure 4) provides utilization metrics of a
single JVM rather than simply system CPU utilization. In this case,
you see that for this application there is minimal JVM-specific
processor utilization—between 10 and 25 percent.
Figure 4. JVM CPU utilization (Percent, last hour)
- JVM memory utilization graph: This Tivoli Composite Application Manager graph (see
Figure 5) provides insights into the utilization
of a single JVM rather than simply system memory utilization. But in
this case, you can see that there is a 30 percent jump in memory use when
load is generated for the application.
Figure 5. JVM memory utilization (Percent, last hour)
- Application-level throughput: Throughput for a
request is available in Tivoli Composite Application Manager for availability, as
well as, problem
determination (see Figure 6). Often, without
throughput and utilization, side-by-side performance bottlenecks
are difficult to correlate with throughput, transaction response times,
and CPU or memory utilization. For this application, there are on
average 240 transactions processed per minute—about 4 TPSs.
Figure 6. Throughput (request/min, last hour)
- Response times: Response times are critical performance
indicators both for your business and for your customers. In this case,
Tivoli Composite Application Manager shows initial transactions with extremely high response times in
the 12-second range (see Figure 7). Once all
resources have been loaded, response times improve to sub-second
levels—around a few hundred milliseconds.
Figure 7. Response time (seconds/min, last hour)
- Web container thread pool: This thread pool is initially
set to a minimum of 50 and a maximum of 50; in most cases, this
setting is sufficient. For this example, 10 concurrent requests were
sent; therefore, 12 web container threads were used. As
Figure 8 shows, there is a maximum of 50 threads,
and 24 percent are actually being consumed.
Figure 8. Thread pools
By default in WebSphere Application Server, asynchronous web request dispatching is not enabled. Enable this setting to process asynchronous web requests by clicking AppServers > Server > Web container > Asynchronous Request Dispatching. Then, on the Configuration tab, select Allow Asynchronous Request Dispatching.
In clustered infrastructure profiles, it makes sense to monitor Distribution & Consistency Services (DCS) threads (see Figure 9). The DCS threads indicate network connections between each member of a cluster, including synchronization of configuration updates. For production configurations, IBM recommends disabling DCS in the WebSphere integrated console, because this feature is only required during configuration.
Figure 9. TCP DCS threads
- Transaction failure rate: The transaction failure rate (see
Figure 10) helps you quickly identify, while performance
tuning, whether transactions are failing. Such failures can occur if, for
instance, the database or other service that the transaction requires is
unavailable. In this small load performance example, zero transactions
are failing, which indicates that all the metrics gathered are valid for
capture in a tuning comparison.
Figure 10. Transaction failure rate
- Database connection pools: In Tivoli Composite Application Manager, the use of database
connection pools (see Figure 11) helps you determine
usage and whether thresholds for the database are employed. If the
transaction cannot access a data source or the request thread needs to
wait on database availability, then this can directly affect response times and
Figure 11. Database connection pools
The agent that Tivoli Monitoring provides for system-level monitoring includes network activity to help determine, based on performance load tests, how much data is passing through the network (see Figure 12). In the case of this example, the aggregate packets per second is 50. This value can increase based on the payload (XML request and response) of the applications and services.
Figure 12. Network activity
Other metrics include CPU load averages, which can include idle time. In this case, totals over 15 minutes average 100 percent. Also provided is the user nice CPU, system CPU, and I/O wait percentages. In the capture interval for Tivoli Monitoring, you can see 100 percent idle time in Figure 13. Tivoli Monitoring system monitoring agents also provide trend analysis to help you make better decisions over time.
Figure 13. Tivoli Monitoring CPU utilization
nmonanalyzer: In certain cases, CPU utilization for performance tuning may require more real-time data for tuning applications and services running on WebSphere Application Server. The
nmonanalyzer, an IBM freeware tool, provides this real-time data. In this example, the POWER7 LPAR has been configured for two physical processors (four logical CPUs) displayed in
nmonas four processors (see Figure 14). Each logical CPU and its utilization is shown, and the total average is 50 percent. Keep in mind that the CPU utilization measurement is taken at time-based intervals and is never exact; therefore, be sure you use multiple tools and modes.
Figure 14. nmon CPU utilization
nmonCPU utilization also provides an
loption to get the total average over a longer period of time when capturing CPU metrics. In this example, there is 90 percent CPU utilization with reoccurring 0 percent utilization. This result is based on the client load performance tool sending 10 concurrent requests and waiting on blocked threads. You can see this in Figure 15, as well as, in a JavaCore file and in a hint from Tivoli Composite Application Manager-monitored web container threads.
Figure 15. CPU utilization in nmon with the l option
prstat. On most UNIX® and Linux® systems,
prstatare available to displace utilization, including memory and CPU. In the example in Figure 16, the
Hoption is set to display thread-level utilization. Each
wasuser thread is a thread within WebSphere Application Server and, in some cases, is consuming 14 percent of the CPU. This consumption can be the result of several things: the actual service request processing a web container thread; DCS threads synchronizing with other members of a cluster; or other WebSphere Application Server-specific threads.
Figure 16. CPU utilization in top
vmstat, you can see utilization as well as other metrics to determine system-level bottlenecks caused by the application. The typical CPU columns us, sys, id, and wa appear in the other tools mentioned earlier.
If you're working with
vmstaton Red Hat Linux running on POWER7, you'll also get the steal information in the far right column. This metric indicates the CPU utilization that the system uses rather than the user CPU utilization. In the example shown in Figure 17, although performance load is running, you can see a high level of steal and only 18–21 percent user CPU utilization for WebSphere Application Server. This result may indicate that a two-physical CPU LPAR configuration profile is too high for this application running on a single JVM because of context switching and processing of non-WebSphere Application Server-specific methods.
Figure 17. CPU utilization in vmstat
The processor data in the r column indicates the number of threads waiting to be processed. A high number indicates a thread bottleneck and many waiting threads. Because the load only ran shortly, shown in the first four rows, there were no waiting threads.
Generating load is critical to successful application tuning. You must establish a baseline before each tuning change with a load tool. Figure 18 shows Jmeter being used to generate WebSphere Application Server load. To initiate load, the requirements for Jmeter are:
- A web service URL
- An XML request payload
- The number of concurrent users
Figure 18. Load testing in Jmeter
For the generation of performance load—and especially for a single request that requires the response payload—Figure 19 shows soapUI. In some cases, you may want simply to confirm that the transaction is successful and validate the response payload. In this example, you can see high response times at the beginning of the application transactions—the result of loading initial classes, data sources, and caching. Before running high volumes of concurrent users, it may be helpful to initiate a few single-thread transactions and view the payload before running load against the new configuration change tuning parameter.
Figure 19. Service testing in soapUI
Debugging problems in a development environment can be quick and efficient for a single application, but a multiple-service application can be challenging to troubleshoot. Tivoli Composite Application Manager for Application Diagnostics correlates Java 2 Platform, Enterprise Edition (J2EE) to J2EE and/or J2EE to WebSphere MQ transactions spanning multiple LPARs. This helps to identify the transaction flow and provide insights into the code to determine response times at the method level. As Figure 20 shows, the trace report of Tivoli Composite Application Manager for Application Diagnostics has identified methods in the application code with high response times. Methods 3, 7, 12, and 13 have been marked for double- and even triple-digit millisecond response times. Furthermore, you can drill down into each method to find an additional breakdown, if you need to know which section of the code requires a fix.
Figure 20. A transaction drill-down in Tivoli Composite Application Manager
Transaction execution paths are also available with Jinsight to identify the cause of high response times. Each method elapsed CPU time is profiled thus quickly identifying what code is responsible for execution problems (see Figure 21).
Figure 21. Xtrace method profiling
This figure shows the execution pattern for the
servlet and which methods are causing high response times. To capture a single
Jinsight transaction, begin by deploying Jinsight. To do so, add the
libjinsight-pLinux.so file to the WebSphere classpath:
JVM Arguments -agentlib:jinsight-pLinux=localFileName=/tmp/trace.trc,localID=10 Start Jinsight Trace jinctl start 10 Stop Jinsight Trace jinctl stop 10
Once a method has been identified with high response times and before you submit a request for an application or service patch, run Xtrace for that method to qualify that it's actually the problem. The benefit of running Xtrace is that it enables profiling for a single method rather than the complete application or service. Therefore, you can run load against the service and determine more accurate total response times, including the response times of a single method that has been identified as a suspect for high response times.
To implement Xtrace in WebSphere Application Server 7, add the following line to the JVM arguments and restart the application server to include the change:
When you have set the methods in Xtrace, you will see in the native_err.out file
each method entry invocation marked with
and each method exit marked with
<. The method
parameters and attributes are marked with
this may not provide as much value as the execution times of entry and exit
Java garbage collection policy determines the performance of your application and
by default is set to
optthruput. This configuration
generates and keeps all objects in a single heap container and therefore has
performance-impacting garbage collection cycles. To improve performance in
all applications, use the
conditional setting for the IBM JDK for garbage collection:
-Xgcpolicy:gencon -Xmnx124M -Xmns124M -Xmos900M
Because the generational conditional garbage collection policy splits the heap into two sections, it may be useful to specify the sizes of these sections of the heap. One section is named nursery and contains objects leaving the heap in the next scavenger or global garbage collection cycle. The other section is named tenured and contains objects that leave only after global garbage collections.
The nursery is specified in the sample application, and you can see in the native_err.out
file the scavenger garbage collection responses. In line 1 of
Figure 22, the
13072.241ms, which is exceptional and could possibly be a concern for nursery
garbage collection. In high-volume scenarios with tighter intervals (for example,
between 50 and 100ms), high CPU utilization for garbage collections is a concern.
Performance tuning the nursery sizes and tenured heap sections is a must.
Figure 22. Output from a thread dump
Core files indicate where threads are blocked or waiting and serve as the
initial point of performance investigation. In Tivoli Composite Application Manager and on Linux or UNIX,
kill -3 <pid> generates this file for
investigation. Using Tivoli Composite Application Manager as a centralized performance management tool for
JavaCore helps, because no additional remote connections or copies of files
across the infrastructure are required.
In this article, you learned about tuning a WebSphere Application Server 7 and POWER7 deployment running Open SCA services. Based on the list of methods and tools provided here, you can develop a process for improving response times, CPU utilization, and hardware cost. Financial transactions and web-based requests will improve even more with added features and SOA services.