Editor's note: Know a lot about this topic? Want to share your expertise? Participate in the IBM Lotus software wiki program today.
Modern server systems have advanced significantly from the single-socket or dual-socket systems of years past. Today, most servers used in commercial applications have one or more multicore processors. Hardware multithreading, such as hyperthreading or simultaneous multithreading, makes the environment more complex by adding more hardware threads. Although hardware multithreading has been in use for several years, now systems are available that take this ability to a new level, combining many cores with high levels of hardware multithreading to give the ability to run hundreds of simultaneous computing threads. An example of this is the Sun CoolThreads server, using the UltraSPARC T1, T2, and T2 Plus processors to support up to 256 simultaneous computing threads.
This development brings up an important question: When it comes to systems with hundreds of simultaneous computing threads, can the software fully use this hardware? We examined this question using WebSphere Portal V6.1 running on a Sun T5240 system with 16 processor cores, supporting 128 simultaneous computing threads. We tried several approaches in this environment before finding one that gave good utilization of the system. This article discusses the approaches tried, the results of measurements on those configurations, plus the tuning and configuration used to get our results.
This scenario is meant to model a portal in a corporate intranet environment. In such an environment, we expect that users are assigned to groups, perhaps by job function or based on organizational boundaries. Some of the content in the portal, for example, places, pages, and applications, is available only to specific groups, while other content is available to all authenticated users. Only a small number of pages are available to unauthenticated users. In this scenario, most users log in to the site.
Content in this scenario is provided by some simple portlets created specifically for this scenario. The set of portlets developed for these measurements uses the API defined in the Java™ Portlet Specification 1.0. None of the portlets uses any database, network content source, or other external system; instead, these portlets generate all of their content from hard-coded inputs.
User interactions are simulated through a virtual user script. Each iteration of this script simulates a single user interacting with the site; this interaction consists of multiple page views. A virtual user runs many iterations of the script during a measurement. Each page view in the script is considered a single transaction. We measure response times for each transaction and the total transaction rate.
The goal of the measurement is to find the highest capacity for the system, which is defined as the highest transaction rate at which response times still meet our criteria.
In our measurement environment, we use a separate database server, directory server, HTTP server, and Deployment Manager server. All of these servers are in addition to the main system under test, the application server. These are servers shown in figure 1.
Figure 1. WebSphere Portal V6.1 on T5240 measurement environment diagram
We tried three configurations in our lab environment. In addition, there's a fourth configuration that might make sense in this environment:
- Standalone application server. The simplest deployment is to deploy WebSphere Portal in a single Java virtual machine (JVM) on the portal server node.
- Vertical cluster environment. A vertical cluster uses the clustering capabilities in IBM WebSphere Application Server to deploy multiple Java virtual machines, all running WebSphere Portal, on a single system.
- Vertical cluster environment with processor sets. This configuration takes the vertical cluster configuration from the previous item and binds each processor to a subset of the available compute threads.
- Solaris virtualization using zones to provide a horizontal cluster. We did not include this configuration in our lab measurements, but it provides a way to further separate each application server JVM from the others running on the same physical server.
Measurement results and observations
In these measurements, we calculated the relative capacity of the three configurations. We took the lowest-capacity configuration, the standalone application server, as our baseline point, and then we compared the capacity of the other two configurations to that baseline. This comparison, shown in figure 2, makes it easy to see the improvement in capacity given by those configurations.
Figure 2. WebSphere Portal V6.1 on T5240 relative throughput
We define capacity as the highest load level at which all transactions give acceptable response times. Thus, the system often reaches capacity before the processor is fully utilized. We observed the processor loads on the portal system in the measurements shown in figure 3.
Figure 3. WebSphere Portal V6.1 on T5240 processor utilization
We had the following observations on the target configurations:
- Standalone application server. Unfortunately, a single JVM was not able to fully use the large number of compute threads, so this configuration gave the lowest capacity. As figure 3 shows, the system was only at 21 percent busy when we reached maximum capacity. This percentage means that, on average, 12 of the 16 processor cores are idle.
The standalone application server is limited for several reasons:
- The application server uses only 50 Web container threads to process all the incoming requests, so no more than 50 of the 128 computing threads can be active processing client requests. This limitation prevents the standalone application server from completely utilizing the processing resources in this system. The number of Web container threads could be increased, but doing so would exacerbate the next problem.
- With a high number of Web container threads busy, any lock contention within the application server becomes a performance bottleneck, increasing response time and reducing capacity.
- Vertical cluster environment. With several JVMs running concurrently, a vertical cluster environment is much better situated to fully use the system. Across the multiple JVMs, there are more Web container threads available, so more of the computing resources are available. In addition, there are no locking issues across the different JVMs, so lock contention becomes less of an issue; each lock has a smaller scope. Together, these factors gave a significant increase in capacity.
- Vertical cluster environment with processor sets. By binding each JVM to a subset of the compute threads, we limited the amount of concurrent traffic that each can process. While this limitation might seem to cause a performance reduction, it gives the opposite result: Each JVM is able to process its work more efficiently, giving higher overall capacity. The result was higher capacity with lower processor utilization.
- Solaris virtualization using zones to provide a horizontal cluster. Although we did not measure this configuration, we believe it would be a good fit for this application. The Solaris zones technology is used to virtualize operating system services and provide an isolated and secure environment for running applications. A zone is a virtualized operating system environment within one instance of the Solaris operating system. Each zone is isolated from another, so that processes in one zone cannot monitor or affect other processes in other zones, even if they have superuser credentials. Thus, a Solaris zone is a lightweight and secure virtualized environment, without a penalty in performance. This approach makes a Solaris zone an ideal virtualization technology for a horizontal cluster. Detailed steps to configure, install, and deploy zones with Solaris can be found in the System Administration Guide: Solaris Containers-Resource Management and Solaris Zones at http://docs.sun.com.
Using Solaris processor sets
Solaris resource pools allow us to group a number of processors into a pool and bind the pool with a Java process. We used processor sets to partition virtual processors.
Through our experiments, we found that the most efficient usage for our scenario was to bind 21 compute threads to one JVM. We created a vertical cluster with six WebSphere Portal members, and then we bound each member to a Solaris processor set. The "right" number of compute threads (and therefore WebSphere Portal members) can vary with the application, but we expect that four to six members will give good performance for most WebSphere Portal applications.
Use the following Solaris commands and follow these steps to set up the configuration:
pooladm –eto enable the pool facility.
pooladm –sto create a static configuration file that matches the current dynamic configuration.
poolcfg –cto create wp_pset1 (unit pset.min=20; unit pset.max =21)’ to create a processor set named wp_pset1 with between 20 and 21 processors. Create one processor set for each application server JVM, each with its own name.
poolcfg –c to create pool wp_pool1’to create a resource pool named wp_pool1. Create one resource pool for each processor set created in the previous step, with unique names for each pool.
poolcfg –c ‘associate pool wp_pool1(pset wp_pset1)’to associate a resource pool with a processor set. Use this command for each processor set/resource pool pair.
The names used above for processor sets and resource pools are arbitrary; the requirements are that the processor set name must be matched up with the resource pool name in step 5, and the resource pool name must be used when binding the JVM to the resource pool in step 9.
pooladm –cto commit the configuration at /etc/pooladm.conf.
- Start WebSphere Portal cluster members from the WebSphere Application Server administration console.
- Find the process IDs for the WebSphere Portal JVMs with the command
ps –ef | grep java. There is a separate process ID for each JVM.
poolbind –p wp_pool1 PortalPID. This command binds the selected process to a resource pool. Repeat this command for each of the portal JVMs, using a separate resource pool for each JVM process.
If you restart WebSphere Portal, you need to run the poolbind command with the new WebSphere Portal process ID. Steps 1-6, though, do not need to be repeated.
More information on these commands can be found in the Solaris documentation set, System Administration Guide: Solaris Containers-Resource Management and Solaris Zones, at http://docs.sun.com.
WebSphere Portal server tuning
We installed WebSphere Portal with the default 64-bit JVM on Solaris running a 64-bit kernel. The large address space of a 64-bit JVM allows for a large heap size; we found that a 3.5 GB heap gave good performance for this application. Other applications can optimize their performance with different heap sizes.
Full tuning details for this measurement, covering the application server, Solaris kernel, network, and other components, are given in the WebSphere Portal Tuning Guide, available at IBM WebSphere Portal Version 6.1.x Tuning Guide.
Table 1 lists the key tuning settings specific to this environment.
Table 1. Tuning settings
|-Xmx3584M||This argument sets the maximum Java heap size to 3.5 GB.|
|-Xmn768M||The new area in the heap was set to 768 MB.|
|-server||The HotSpot VM is selected by using this argument. This VM typically provides best performance for server-based applications such as WebSphere Portal.|
|-XX:MaxPermSize=768M||On a 64-bit heap, a large permanent region is needed in the heap; this argument allows the permanent region to grow up to 768 MB.|
|-XX:+UseConcMarkSweepGC||Use the concurrent mark-sweep collection (UseConcMarkSweepGC) for the tenured generation. The application is paused for short periods during the collection. We found that this collector gave the most stable response times|
|-XX:+UseParNewGC||By default, the concurrent low pause collector uses a single-threaded young generation copying collector. This argument enables parallel collection of the new area of the heap, providing better utilization of the available compute threads.|
|-XX:SurvivorRatio=6||The survivor ratio tunes the distribution of spaces within the new area of the heap. This argument helped make more efficient use of that new area.|
|-XX:ParallelGCThreads=5||On the chip multithreading processor-based system, the number of garbage collection (GC) threads should be no higher than one quarter of the total compute threads. Our system has 128 compute threads, so we should use 32 (128/4) GC threads. We distributed these 32 GC threads across 6 JVMs, giving 5 (32/6) GC threads per JVM.|
Any environment with a large number of compute threads has the possibility of encountering performance problems if there is a significant amount of locking within the application. This problem is exaggerated on an environment such as the Sun T5240. Because each compute thread is a relatively slow virtual processor, this type of processor architecture often holds locks longer than in more conventional processor architectures.
Significant effort went into WebSphere Portal V6.1 to reduce locking within the WebSphere Portal framework, reducing performance problems due to locking. WebSphere Portal is an application framework that can be used to run a large variety of applications. If the applications (portlets) used have significant amounts of locking in them, then locking can become a performance bottleneck.
WebSphere Portal can run on a variety of systems with widely different architectures. The results shown here demonstrate that it can achieve good capacity on a system with a large number of compute threads. This article also explains the tuning and configuration needed to achieve this capacity.
- Participate in the discussion forum.
- Read more about WebSphere Portal in the WebSphere Portal product wiki.
- Refer to the WebSphere Portal product page on developerWorks.
- Refer to the IBM WebSphere Portal V6.1. Tuning Guide.