WebSphere Peformance - Alexandre Polozoff's Point of View
polozoff 110000N2A2 Tags:  is accept the to set ensure browser jmeter proxy certificate 1 Comment 5,733 Views
Testing requires a tool and for one of my projects I'm using JMeter. I'm testing an https based site and was just having a hard time figuring out what was going on. I kept seeing an error in the JMeter "view results tree" that just said "ensure browser is set to accept the jmeter proxy certificate". I started researching that phrase and got nowhere quickly.
However, in the jmeter/bin subdirectory I found the jmeter.log file. In there I found a java.security.NoSuchAlgorithmException and referencing the SunX509 KeyManagerFactory. Ah ha, yes, I'm running the IBM JRE and not the Sun version.
Unfortunately changing the jmeter.properties proxy.cert.factory=IbmX509 (and of course uncommenting it) had no effect and I got the same SunX509 exception. I decided to try it at the command line as:
jmeterw -Dproxy.cert.factory=IbmX509 and voila the problem went away!
Some tools collect data in counts but that isn't a very useful number when trying to understand how much time
For example, on a 2.993GHz machine, a function with 36040 cycles would convert as (36040/2993)/1000, which is 12.04143 usec or .01204 milliseconds.
From this link
With WebSphere Application Server v8 the IBM JRE provides a new garbage collection policy known as balanced. One should consider trying the balanced policy if running on the 64-bit JVM using a Java heap size over 4GB and still experiencing occasional long pauses with the gencon policy. It does impose a slight performance hit but based on how some applications are written or coded it may be necessary for the runtime operations team to try this policy in an attempt to avoid large pause times. The balanced policy can also take advantage of non-uniform memory access (NUMA) hardware architecture available on System x® and System p® using current versions of AIX®, Linux® or Windows®.
The past few weeks meeting with various WebSphere Application Server-based customers reminded me of the importance of the basic and fundamental performance tuning tasks. The InfoCenter provides information on tunings at the OS level, TCP/IP, JVM, etc. I have visited no less than 3 different environments running WebSphere Application Server without these base tunings. Just by applying the base tunings to the OS and JVM we saw as much as 99% less garbage collection, improved response time, throughput and less CPU utilization with the same production loads. The best part of following these instructions is the administrator does not need to be a performance guru to realize these gains. These improvements also help save money requiring less capacity going forward.
In the "Tuning the JVM" section I have never been disappointed with the "Option 1" settings. Options 2 and 3 require the ability to place the application under load/stress test. If you do not have a load/stress test environment (i.e. you have to test in production) then stick with "Option 1".
Notice that "Tuning Performance" has several sections for both application developers and WebSphere administrators. This is because we all know that to realize the best performance gains one has to optimize the application code. Runtime tuning can realize 5-15% but application code improvements can see 300% and higher performance improvements.
For as long as I can remember the most debated Java topic has been the difference in opinion on the heap size minimum = maximum with lots of urban myths and legends that having them equal was better. In a conversation with a number of colleagues and Chris Bailey who has lead the Java platform for many years he clarified the settings for the IBM JVM based on generational vs non-generational policy settings.
"The guidance [for generational garbage collection policy] is that you should fix the nursery size: -Xmns == -Xmnx, and allow the tenured heap to vary: -Xmos != -Xmox. For non generational you only have a tenured heap, so -Xms != -Xmx applies.
A link to Chris Bailey's presentation on generational garbage collection http://www.slideshare.net/cnbailey/tuning-ibms-generational-gc-14062096
[edit to correct typo, added tags]
Your application is slow. You get a thread dump and look in the javacore and see lots of threads is ClassLoader.loadClass() with one thread holding the lock. You need to check your FFDC logs and look for "Too many open files." This means you haven't tuned the OS ulimit parameters and probably many others. Look in the InfoCenter for performance tuning and operating systems and pick the page for your OS. This should be the first link in the InfoCenter you access after you install WebSphere Application Server.
Edit: added link to the WAS v7 Infocenter page.
polozoff 110000N2A2 Tags:  tcp connection ip over fail refused windows poor performance 3,042 Views
As I travel the world working performance problems I never see Microsoft Windows environments used outside the developer's desktop. Surprisingly these past couple of weeks I've been working in an environment where Microsoft Windows is used for the IBM HTTP Server tier with the WebSphere Application Server plug-in. Under normal operating conditions everything seems to work nominally.
However, much to my surprise, if we took down any of the application servers in the cluster of this very large cell I saw an anomaly. When the plug-in was attempting to route traffic to the downed application servers there seemed to be a really long lag on the connection refused processing. In fact, I was seeing least a second to get through the TCP/IP roundtrip. This made no sense to me. One of my colleagues, Keys Botzum, took a Java application and ran it on both Windows and Solaris. The application simply tried to connect to localhost (to eliminate any DNS lookups or network latency from the test) on a port no one was listening to and looped around 20 times. On Windows the test took slightly over 20 seconds. On Solaris, less than a second (which was the behaviour I was expecting on Windows).
If you are, or planning to, use Microsoft Windows on the IHS tier be aware of this strange failure scenario on Windows. I'll try to investigate and see if there are any Windows settings to help tune this. Though the plan is to move off Windows to Redhat Linux which right now sounds like the right move to me.
Happy Thanksgiving to everyone. I hope everyone was able to get a good meal and time with family today.
This week I'm writing to you from Seoul, South Korea (it is actually Friday the day AFTER Thanksgiving here yet the Macy's Thanksgiving parade I am watching via Slingbox is still on). I'm working with some colleagues here and doing some mentoring and skills transfer to help broaden the problem determination skills within IBM. Which brings me to today's topic. We encountered a classic application hang. Sometimes, but not all the time, the administrator would restart the application on WAS v8.5 and when the test team started to apply load to the application it would hang. Javacores from kill -3 showed all threads stuck in createOrWaitForConnection. Now for those of you who do follow my blog you probably know about the various techniques I've posted to debug this situation. As we had no access to the developers it was up to us to try and figure out what was causing the hang. Various random twiddling of various AIX OS level parameters didn't work (random changes never do). If they waited long enough the application would sometimes recover and start processing again.
After watching the testing go on for a while I finally suggested we increase the connection pool maximum size to 2n+1 where n = thread pool maximum. The setting the team had set the connection pool maximum was equal to the thread pool max. There was some disbelief that we should go down this path. Any good administrator knows that we want classic funneling where thread pool max is larger than connection pool max to make optimal use of memory, CPU, etc. They re-ran the test and after the 5th attempt realized that we would not recreate the hang. I've posted this command before:
netstat -an |grep ESTA |grep <port#> |wc -l
which gives a connection count to the database on port#. It may be double the value (showing source and destination connections) so you may have to divide the value in half. In our case with thread pool max at 50 and connection pool max set to 101 we were capturing as many as 90 established connections to the database at any one time. Obviously the developers of the application were following the anti-pattern of opening a second connection to the database before closing the first connection which resulted in the deadlock our team in Seoul was observing.
So why wasn't this deadlocking with each and every test? That comes down to randomness. Load tests while they may follow a set process and scripts there is some variability between each test. While it may not vary widely test after test the variability exists in terms of timing on the server. There can be various processes running, or not, at any given point in time. Load on the CPU or tasks the OS is doing can subtly change that timing inducing variability. Timing is key and in some cases the test team got lucky and the test would work. Other times the timing was off and the application would deadlock. This particular anti-pattern is very sensitive to timing. Get the wrong timing and the application will deadlock and hard.
In addition, when they would wait a while the application would recover. This is because underneath the cover of WAS it is quietly reclaiming connections because it knows how long threads have been holding open connections. Once a threshold (timeout) is reached WAS begins the active process of reclaiming connections that have been opened too long. This results in free connections being returned to the pool and the threads that were stuck in createOrWaitForConnection can resume processing.
What is the lesson learned here? When load testing an unknown application it might be worth setting connection pool max to 2n+1 of the thread pool max just to start with and using the command line netstat command (or your application monitoring tools) to see how many connections the application attempts to use. Then once experience is gained with the application reduce the size of the connection pool to something more reasonable based off the observed high water marks in the the connection pool utilization. This is a lot easier tactic than trying to debug an application that is deadlocked in createOrWaitForConnection.
After an application outage or an extremely negative performance event one needs to conduct root cause analysis to try and determine the next corrective course of action. Having done this many times let me document some of the steps done in the first/initial phases of trying to figure out just what happened.
The first task is to inventory what you have, how it is configured and deployed. This includes all software version information, configuration items for the application, pool sizes, etc.
Once that information is gathered understand what may be missing and asking a lot of questions. Is the software at the latest version or fixpack level. If not, why not? Is there anything in the patches subsequent to the version in production that may address the problems encountered? Are there any odd configurations (i.e. JDBC pool size is 3x larger than the thread pool size; 300 second timeouts, etc)? Understand odd configurations and try to determine why they exist. Often this is difficult because the people that initially configured and deployed the environment have moved on to other projects and the team you're dealing with is simply in maintenance mode.
2. Discovery / Data Collection
In order to solve a problem we have to have data about the problem. No data, no resolution because any decision is just a guess. Guesses do not work. My assumption here is we are investigating Java based applications.
a. Were thread dumps collected during the negative event? If not, why not? Thread dumps are collected using 'kill -3 <pid>' (this doesn't "kill" the process it just sends signal #3 to the JVM which is caught by the JVM and it dumps all the Java threads at that point in time) on Unix based systems. Collect thread dumps during all negative events in the future if they were not caught in the past. Thread dumps are a crucial piece of the puzzle to help narrow down what is going wrong.
b. Is verbose GC (garbage collection) enabled? If not, why not? Verbose (and the term is unfortunate as it is not that verbose) GC is another crucial piece of data to understanding what the memory utilization was like during the negative event.
c. If the application was written in house then initiate a code review. Software is written by humans and humans err. It could be a bug in the application that only kicks in during the appropriate planetary alignment event. Reviewing code, on a periodic basis, is a good idea in general even if you are not having any problems.
d. What backends are the applications accessing? Is there any information from the backend that would indicate participating in the negative event (i.e. log files, DB2 snapshots, etc)? It would not be the first time that some negative condition in the backend was causing a front end backlog. It could also be related to bugs in the application (see 2c above).
e. Are any application monitoring tools in place? Java is a robust environment that allows for rather detailed application monitoring of various factors like pool utilization, application response time, SQL response times, etc. Not having an application monitor in place simply limits the ability to understand what happened. Having an application monitor in place also allows for alerts to be issued when a negative event is detected. This allows for proactive actions to be taken by people who can troubleshoot the problem and hopefully fix it before the users ever notice.
f. Look in the application log files. There may be a indication of what is going on in the application logs. This really depends on how well the developers implemented logging in the application and may or may not be of any use. Fingers crossed!
Get through this initial set of steps and then you can go on to the next phase which is actually figuring out just what went wrong. Which I'll write about in my next installment.