WAS Capacity Planning Story
In production, you need to make sure that your WebSphere Application Server clusters have enough capacity.
How much capacity is enough?
Part of that answer is very dependent upon your specific applications and the load they create at predicted levels of usage. But there's also the issue of how much capacity do you need to run that predicted load, whatever it may be.
This is a real customer story. I'm probably not supposed to say who it is, but they're a big company you've definitely heard of and perhaps do business with it. Their business runs on IT, so it's very important to them that the have adequate capacity; if they run out of capacity, their customers don't get service and the company looses money.
Adequate capacity doesn't mean running resources (the host servers' CPU, memory, I/O bandwidth, etc.) at 100%; resource usage fluctuates even while load remains constant, and 100% leaves no room to fluctuate upward, even briefly, much less room to accept additional load by additional users or the same users performing more resource-intensive tasks. One rule of thumb is that full capacity for a server is about 80%; that gives it room to fluctuate and for the OS to perform overhead tasks. Now, 80% is not an ideal, it's a maximum, like filling a glass "full" means just below the rim so it won't spill.
This customer runs its servers at about 10% capacity. 10%?! What, do they just like buying hardware for no reason? No, they're very cautious, and understandably so. They're not planning for the best case scenario--that's 80%; they're planning for the worst case scenario, which is rare but can happen. And that worst case scenario means that under best case conditions, they run their servers at 10% capacity.
So what is this worse case scenario they plan for? First, they want to be able to take as much as half of their server capacity off line at one time to perform maintenance like OS patches and WAS upgrades. So that means their servers need to normally run at 40-50% capacity so that they'll run at 80-100% during major planned maintenance. Next, they want to be prepared for as much as half of their capacity to crash in a massive outage. (If they suffer a 100% outage, they've got bigger problems.) The outage could occur during maintenance, so that would reduce capacity to 25% of normal. Furthermore, they want to be prepared for a usage spike that as much as doubles traffic. If traffic doubles during an outage and maintenance, that's 10-12.5% of normal capacity. Thus under normal circumstances, they run at 10% capacity, so that even if their capacity gets halved twice and their load doubles, they're still running at 80% of available capacity and can adequately meet user demand.
Moral of the story: Don't run your servers at 100%, run them at more like 80%, and that's if you plan on all servers always being on-line and load being constant. Since neither of those assumptions is realistic, back off your capacity maximum accordingly. Your IT may not be as important to your customers and business as they are for this WAS installation, but you still need to ask yourself "What if we lost a lot of capacity? How much is too much?" and then plan accordingly.