WebSphere Peformance - Alexandre Polozoff's Point of View
polozoff 110000N2A2 Tags:  is accept the to set ensure browser certificate proxy jmeter 1 Comment 3,300 Visits
Testing requires a tool and for one of my projects I'm using JMeter. I'm testing an https based site and was just having a hard time figuring out what was going on. I kept seeing an error in the JMeter "view results tree" that just said "ensure browser is set to accept the jmeter proxy certificate". I started researching that phrase and got nowhere quickly.
However, in the jmeter/bin subdirectory I found the jmeter.log file. In there I found a java.security.NoSuchAlgorithmException and referencing the SunX509 KeyManagerFactory. Ah ha, yes, I'm running the IBM JRE and not the Sun version.
Unfortunately changing the jmeter.properties proxy.cert.factory=IbmX509 (and of course uncommenting it) had no effect and I got the same SunX509 exception. I decided to try it at the command line as:
jmeterw -Dproxy.cert.factory=IbmX509 and voila the problem went away!
For as long as I can remember the most debated Java topic has been the difference in opinion on the heap size minimum = maximum with lots of urban myths and legends that having them equal was better. In a conversation with a number of colleagues and Chris Bailey who has lead the Java platform for many years he clarified the settings for the IBM JVM based on generational vs non-generational policy settings.
"The guidance [for generational garbage collection policy] is that you should fix the nursery size: -Xmns == -Xmnx, and allow the tenured heap to vary: -Xmos != -Xmox. For non generational you only have a tenured heap, so -Xms != -Xmx applies.
A link to Chris Bailey's presentation on generational garbage collection http://www.slideshare.net/cnbailey/tuning-ibms-generational-gc-14062096
[edit to correct typo, added tags]
Someone takes a javacore during what looks to be a hung app server and notices it contains lots of threads in socketRead. This is symptomatic of a slow back end whether it is a database, Web service, etc. An application is as strong as its weakest link. If the backend the application depends on is unable to respond in a timely manner then there is nothing that can be tuned at the application layer except for aggressive timeouts to protect the application from getting stuck. Hangs like these typically happen under high load/traffic conditions. It is important that the group that maintains the backend is aware of an issue with their tier and they need to fix it.
This is the page to follow if there seem to be any Maximo performance or stability problems.
Report scheduler enhancements in Maximo v7.5. As with any online transaction application most enterprises need to pull reports from their environment. Reports tend to be (a) scheduled to repeat and (b) heavy users of CPU and memory. Therefore having more control on the report scheduler is a good thing to look at in Maximo v7.5.
polozoff 110000N2A2 932 Visits
I am often asked specifically what metrics should be monitored in WebSphere Application Server. This list can be considered a starting point. The WAS admin will frequently use "Custom" PMI setting and disable any metrics not in this list.
Suggested: Pool Size, Active Thread Count
Optional: ActiveTime (avg. time in use), ConcurrentHungThreadCount.
For WESB, per-mediation: Mediation.ThreadCount
Database - JDBC Data Source Connection Pools.
Messaging - JMS Queue Connection Factory Connection Pools.
Suggested: PoolSize, WaitingThreadCount, WaitTime,
Suggested: BufferedReadBytesCount, BufferedWriteBytesCount
Suggested: AvailableMessageCount, LocalMessageWaitTime
Transactions (per application server)
Memory Management. (Java GC).
Suggested: Heap Size, % Free after GC, % Time spent in GC.
polozoff 110000N2A2 Tags:  machines sla small peformance planning nfr failover frame capacity large 1,062 Visits
Yesterday at my "Performance Testing and Analysis" talk a question was asked is it better to have one large machine and virtualize the environment or to have lots of small machines?
Both strategies work. If deciding to go with large capacity frames then you want to have at least 3 frames. This way if one frame is taken out of service there are at least two other frames running. Otherwise, if one builds out only 2 large frames and one is taken out of service then the remaining frame becomes a single point of failure. I don't like SPOFs and so would have at least 3. This is if one frame can take the entire production load during the outage. That information has to be culled from the performance testing to see if there is enough capacity in one frame to carry the entire production load. If not then more likely than not there will need to be additional frames to be able to ensure that even if half the infrastructure is taken out of service the remaining frames can carry the entire production workload.
Likewise, having lots of smaller machines also works. Odds are less likely for a massive hardware outage with smaller machines so as one fails it can be safely taken out of service and replaced without impacting the production workload.
Which strategy is better? In my opinion they are both valid strategies as long as the proper capacity planning is conducted to ensure that when an outage occurs (and think worst case scenario here) that the remaining infrastructure is able to continue processing the production workload without impacting the SLA (Service Level Agreement including the non-functional requirements [i.e. response time, resource utilization, etc]). You do have a defined SLA, right?
polozoff 110000N2A2 111 Visits
A few years ago I blogged about how adding JSESSIONID logging to the access log helps identify which cluster member a user was pinned to. It turns out this also helps troubleshoot another interesting problem.
A WebSphere Application Server administrator noted that the session count on one of their JVMs in the cluster was getting far higher session counts than any other JVM in the cluster. So much so it was like a 3:1 imbalance in total number of sessions in the JVM. We applied the JSESSIONID logging and captured all the session ids. Through various Unix utilities (cut, sort, uniq, etc) we ended up with a prime suspect. One session was calling the /login page 10-20 times per second and had eclipsed every other session by over 10x the number of requests.
Why did we go down this path? We were able to see through the PMI data that the session manager in WebSphere Application Server was invalidating sessions. So we knew it wasn't an issue in the product of not deleting sessions. Also, with one JVM in the cluster creating more sessions than the other JVMs is suspicious. I would have expected to have seen higher load across the cluster. In addition, they have seen the behaviour move around the cluster every few months. That lead me to believe this was like a replay attack. Someone at some point captured a response with a JSESSIONID and was then using that JSESSIONID over and over again until some event caused it to capture a new JSESSIONID (most likely from a failover event as the cluster went through a rolling restart). That behaviour was curious! The fact it was smart enough to realize the HTTP header content changed and adapted was interesting.
So next time you see one or more JVMs with considerably higher session counts than the other JVMs in the same cluster you can use the same troubleshooting methodology to track down who the suspect is. Especially if your application is Internet-facing meaning anyone can start pinging your application.
polozoff 110000N2A2 393 Visits
I have had a number of conversations with various colleagues lately around "optimal performance." It kind of reminds me of an old post I had about predicting capacity and application performance for a non-existent application (hint: it can't be done).
One scenario we were discussing was an application that was currently in one of its testing cycles. And the question came up, well, could we have another team test the same application and see if they could get better performance or at least validate the performance we are seeing is consistent? That seemed an unusual request. It turns out after some questioning that one team was not confident in the people they had or the results they were seeing. My answer to that was they should pull in the right resources to help. Having a second performance team test the same application made little sense. Just standing up a test environment would have been a big effort not to mention getting floor space in the data center.
So how do you determine if the performance your application sees is optimal performance?
The answer is two fold. First, like my previous post of when is enough performance tuning really enough is driven by business requirements (SLAs) and having the appropriate application monitoring tools in place to verify the application is meeting those SLAs.
The second is a little greyer. While SLAs should hold a lot of the requirements one that is commonly not in scope is capacity. It is difficult to predict capacity until performance testing has been completed. While the application may meet response time and transaction throughput requirements we may discover that, especially in high volume environments, the necessary hardware infrastructure capacity to support the application is going to cost a lot of money (or more money than was expected). Much like the conclusion in my previous article that in this case there may be a reason to continue performance tuning (including application development to reduce resource utilization) to try and reduce the infrastructure cost impact. However, at some point in time a decision will have to be made if the performance tuning effort is exceeding the cost differential of the increased capacity needed to support the application.
polozoff 110000N2A2 374 Visits
Here is another excerpt from our performance cookbook that will be published in the near future.
The past few weeks meeting with various WebSphere Application Server-based customers reminded me of the importance of the basic and fundamental performance tuning tasks. The InfoCenter provides information on tunings at the OS level, TCP/IP, JVM, etc. I have visited no less than 3 different environments running WebSphere Application Server without these base tunings. Just by applying the base tunings to the OS and JVM we saw as much as 99% less garbage collection, improved response time, throughput and less CPU utilization with the same production loads. The best part of following these instructions is the administrator does not need to be a performance guru to realize these gains. These improvements also help save money requiring less capacity going forward.
In the "Tuning the JVM" section I have never been disappointed with the "Option 1" settings. Options 2 and 3 require the ability to place the application under load/stress test. If you do not have a load/stress test environment (i.e. you have to test in production) then stick with "Option 1".
Notice that "Tuning Performance" has several sections for both application developers and WebSphere administrators. This is because we all know that to realize the best performance gains one has to optimize the application code. Runtime tuning can realize 5-15% but application code improvements can see 300% and higher performance improvements.
I am writing this blog entry to remind people that data collection should occur in all phases of an application's life cycle in production. This includes collecting data like javacores and heapdumps in production when the application is running in a nominal, steady state condition. This provides valuable data for when problems do occur to be able to compare the data when negative behaviour is occurring to when things were not bad. The data also helps to feed trend analysis in conjunction with application monitoring tools in terms of understanding what users are doing under both conditions.
Collect data often and be prepared for that day when all of the sudden the production environment is exhibiting distress.
Troubleshooting a problem? If you are following the mustgathers then here is a link to a quick reference guide on how to set up trace in WebSphere Application Server.
In a recent review meeting on a problem with a high volume application many of the same questions that have been asked in the past were brought up. How does one prevent one problem from cascading into separate, unrelated facets of the application. On my old blog I spoke about circuit breakers in the specific case of a loop gone haywire. There are other kinds of circuit breakers that can be placed in applications that I have seen and proven work well.
One of the ones I tend to like and haven't really blogged about much allow the operations folks to disable specific functions of an application. This is easily facilitated if the application is well designed (i.e. functions are easily identifiable by examining the HTTP request itself) or is compartmentalized (i.e. separate functions are handled by separate logical clusters) where one cluster of servers only handles the "search" functions because we know that search will tend to exhaust resources vs the "checkout" function which we want to run 100% of the time so that every user that wants to can purchase the goods in their shopping cart. The beauty of this set up is that if any specific function, as detected through the application monitoring infrastructure, is experiencing a failure or is causing an unexpected bottleneck can quickly trip the circuit breaker and shunt any following requests to a "Sorry, not available" page.
The ability of this type of circuit breaker is key for a couple of reasons. First and foremost it addresses the fact that a failure of some sort is in progress and even though it hasn't been fixed we can quickly move traffic to another path that at least gives the end user a response. This avoid additional requests from overwhelming the production environment and having to restart all the servers to clear things up. The other reason is that it also allows for more sharing of the infrastructure because we have a plan to follow in the our runbook where we can quickly alleviate the problem by simply turning off the spigot.
I have seen two different approaches to solving this problem. In the case of the infrastructure if the functions of the application are easily identifiable or clustered independently then the operations team can easily modify either load balancing rules or make changes to the HTTP plugin configuration. I particularly like this one because as soon as the operations team has identified a particular fault they can trip the appropriate circuit breaker and get started with the problem determination steps.
Another approach, which can be used in combination with the previous solution, is to actually build into the application circuit breaker checks at various points in the code. This would then cause a read from the database to check on a bit in the environment to see if it should continue processing the current function or not. Similar to the loop circuit breaker I referenced above where if we know our loops should never iterate more than 500 times to have them abort and throw an exception on the 501st iteration. If there is a consensus among the operations and development teams that some piece of functionality has broken and bit can be flipped in the database and that function is either disabled and directs to an error page or can alternatively provide back some cached value (if possible, it depends on the kind of data the user was going after).
After an application outage or an extremely negative performance event one needs to conduct root cause analysis to try and determine the next corrective course of action. Having done this many times let me document some of the steps done in the first/initial phases of trying to figure out just what happened.
The first task is to inventory what you have, how it is configured and deployed. This includes all software version information, configuration items for the application, pool sizes, etc.
Once that information is gathered understand what may be missing and asking a lot of questions. Is the software at the latest version or fixpack level. If not, why not? Is there anything in the patches subsequent to the version in production that may address the problems encountered? Are there any odd configurations (i.e. JDBC pool size is 3x larger than the thread pool size; 300 second timeouts, etc)? Understand odd configurations and try to determine why they exist. Often this is difficult because the people that initially configured and deployed the environment have moved on to other projects and the team you're dealing with is simply in maintenance mode.
2. Discovery / Data Collection
In order to solve a problem we have to have data about the problem. No data, no resolution because any decision is just a guess. Guesses do not work. My assumption here is we are investigating Java based applications.
a. Were thread dumps collected during the negative event? If not, why not? Thread dumps are collected using 'kill -3 <pid>' (this doesn't "kill" the process it just sends signal #3 to the JVM which is caught by the JVM and it dumps all the Java threads at that point in time) on Unix based systems. Collect thread dumps during all negative events in the future if they were not caught in the past. Thread dumps are a crucial piece of the puzzle to help narrow down what is going wrong.
b. Is verbose GC (garbage collection) enabled? If not, why not? Verbose (and the term is unfortunate as it is not that verbose) GC is another crucial piece of data to understanding what the memory utilization was like during the negative event.
c. If the application was written in house then initiate a code review. Software is written by humans and humans err. It could be a bug in the application that only kicks in during the appropriate planetary alignment event. Reviewing code, on a periodic basis, is a good idea in general even if you are not having any problems.
d. What backends are the applications accessing? Is there any information from the backend that would indicate participating in the negative event (i.e. log files, DB2 snapshots, etc)? It would not be the first time that some negative condition in the backend was causing a front end backlog. It could also be related to bugs in the application (see 2c above).
e. Are any application monitoring tools in place? Java is a robust environment that allows for rather detailed application monitoring of various factors like pool utilization, application response time, SQL response times, etc. Not having an application monitor in place simply limits the ability to understand what happened. Having an application monitor in place also allows for alerts to be issued when a negative event is detected. This allows for proactive actions to be taken by people who can troubleshoot the problem and hopefully fix it before the users ever notice.
f. Look in the application log files. There may be a indication of what is going on in the application logs. This really depends on how well the developers implemented logging in the application and may or may not be of any use. Fingers crossed!
Get through this initial set of steps and then you can go on to the next phase which is actually figuring out just what went wrong. Which I'll write about in my next installment.