There are a lot of knobs and switches in the WebSphere Portal and Application Server software stack that you can play with in the process of tuning the environment. We provide guidance and recommendations of what to pay attention to through our Performance and Tuning Guide, but the fact is that what actual values work best for you depends highly on what type of applications you run in Portal and the size and activity of your user population. The only way I know of to accurately arrive at the correct set of tuning values is to follow this basic testing methodology:
- Install and configure WebSphere Portal on a single system with some simple default content and perform the requisite tuning as perscribed by the Performance and Tuning guide.
- Take a baseline performance measurement of this simple site, making adjustments to the infrastructure and software tuning until desirable results are obtained. Also get an idea for how many users the baseline system can sustain before the system runs out of resources (more on this later).
- Start adding content and repeat the analysis. Adjust the tuning as necessary. At least now you know, based on the baseline, that as changes are made, if things go wrong, you have an idea of what change may have instigated it.
Performance and capacity testing is a long, highly iterative process. It is also resource intensive, as it requires dedicated systems for days on end, as well as enough people to manage the test environment, observe test results, and tune the system. Often times, this important process is the first thing cut from a deployment plan in jeapordy. In my experience, you either pay for this time up front, before you go live, or you pay for it after you go live. It must be done, and is a lot more expensive the later you do it.
But all this being said, what should your goal be? Obviously, you go into this process with certain metrics in mind. For instance, I want to be able to handle 400 concurrent users with no worse than a 5 second response time, and maybe that is only during login. That's fine, and simple to measure, but there will be days where, for some reason, you have a lot more than 400 concurrent users, or you have to take systems down and the remaining systems must take the load. It isn't enough to know if your environment can handle the typically load; you need to know if it can handle the atypical, or worst case, scenario as well. You may not know what the worst case scenario is up front. But what you do need to know is what the maximum capacity of your portal environment is, so as the usage approaches that number, you will know you are in trouble and need to add capacity.
To understand what your environment's capacity is, you need to drive utilization of your portal environment to the point of CPU exhaustion. Not memory exhaustion, or DB connection exhaustion - CPU exhaustion. The reason for that is that as CPU utilization approaches 100%, the system slows down to the point of nearly being unresponsive, but it doesn't fail. Under this condition, the Web servers managing the load across a portal cluster will mark such a system as down and route traffic elsewhere, giving the server a chance to recover, which it should once the requests have been processed. If you run out of some other resource, like memory or DB connections, before running out of CPU, then things really start to fail and you won't recover from that.
So, as you perform your maximum capacity tests, give the server instances a large enough Java heap size and request and DB connection pool sizes to allow for enough traffic through driving CPU to 100%. If you can't before running out of resources other than CPU, then it is time to scale vertically (creating vertical cluster members on the same physical system) or reallocate processors to other systems. If you can configure the system to meet this goal, then you have your "sweet spot".