WebSphere Peformance - Alexandre Polozoff's Point of View
polozoff 110000N2A2 Tags:  is accept the to set ensure browser certificate proxy jmeter 1 Comment 7,672 Views
Testing requires a tool and for one of my projects I'm using JMeter. I'm testing an https based site and was just having a hard time figuring out what was going on. I kept seeing an error in the JMeter "view results tree" that just said "ensure browser is set to accept the jmeter proxy certificate". I started researching that phrase and got nowhere quickly.
However, in the jmeter/bin subdirectory I found the jmeter.log file. In there I found a java.security.NoSuchAlgorithmException and referencing the SunX509 KeyManagerFactory. Ah ha, yes, I'm running the IBM JRE and not the Sun version.
Unfortunately changing the jmeter.properties proxy.cert.factory=IbmX509 (and of course uncommenting it) had no effect and I got the same SunX509 exception. I decided to try it at the command line as:
jmeterw -Dproxy.cert.factory=IbmX509 and voila the problem went away!
Some tools collect data in counts but that isn't a very useful number when trying to understand how much time
For example, on a 2.993GHz machine, a function with 36040 cycles would convert as (36040/2993)/1000, which is 12.04143 usec or .01204 milliseconds.
From this link
With WebSphere Application Server v8 the IBM JRE provides a new garbage collection policy known as balanced. One should consider trying the balanced policy if running on the 64-bit JVM using a Java heap size over 4GB and still experiencing occasional long pauses with the gencon policy. It does impose a slight performance hit but based on how some applications are written or coded it may be necessary for the runtime operations team to try this policy in an attempt to avoid large pause times. The balanced policy can also take advantage of non-uniform memory access (NUMA) hardware architecture available on System x® and System p® using current versions of AIX®, Linux® or Windows®.
Your application is slow. You get a thread dump and look in the javacore and see lots of threads is ClassLoader.loadClass() with one thread holding the lock. You need to check your FFDC logs and look for "Too many open files." This means you haven't tuned the OS ulimit parameters and probably many others. Look in the InfoCenter for performance tuning and operating systems and pick the page for your OS. This should be the first link in the InfoCenter you access after you install WebSphere Application Server.
Edit: added link to the WAS v7 Infocenter page.
I'm working on a Liberty server (this is the latest beta I downloaded a couple of days ago) and using the installUtility I'm getting the following error.
# bin/installUtility install adminCenter-1.0
CWWKF1219E: The IBM WebSphere Liberty Repository cannot be reached. Verify that your computer has network access and firewalls are configured correctly, then try the action again. If the connection still fails, the repository server might be temporarily unavailable.
I then found out about a command to help try and figure out what is wrong
bin]# ./installUtility find --type=addon --verbose=debug
[6/25/15 10:57:53:125 CDT] Failed to connect to the configured repository:
[6/25/15 10:57:53:128 CDT] com.ibm.ws.massive.RepositoryBackendIOException: Failed to read properties file https://public.dhe.ibm.com/ibmdl/export/pub/software/websphere/wasdev/downloads/assetservicelocation.props
Will update when I have more details on why I'm getting the ClassNotFoundException.
and that resolves the issue. A defect has been raised to have the script use the Java we supply instead of the machine's.
[Edited Aug 25 to add
I also needed to update /etc/host.conf to enable hosts file lookup and then add entries for
to /etc/hosts file
The Aug 2015 beta seems to have made a number of fixes to installUtility so if you're on an older beta get the latest.]
For as long as I can remember the most debated Java topic has been the difference in opinion on the heap size minimum = maximum with lots of urban myths and legends that having them equal was better. In a conversation with a number of colleagues and Chris Bailey who has lead the Java platform for many years he clarified the settings for the IBM JVM based on generational vs non-generational policy settings.
"The guidance [for generational garbage collection policy] is that you should fix the nursery size: -Xmns == -Xmnx, and allow the tenured heap to vary: -Xmos != -Xmox. For non generational you only have a tenured heap, so -Xms != -Xmx applies.
A link to Chris Bailey's presentation on generational garbage collection http://www.slideshare.net/cnbailey/tuning-ibms-generational-gc-14062096
[edit to correct typo, added tags]
The past few weeks meeting with various WebSphere Application Server-based customers reminded me of the importance of the basic and fundamental performance tuning tasks. The InfoCenter provides information on tunings at the OS level, TCP/IP, JVM, etc. I have visited no less than 3 different environments running WebSphere Application Server without these base tunings. Just by applying the base tunings to the OS and JVM we saw as much as 99% less garbage collection, improved response time, throughput and less CPU utilization with the same production loads. The best part of following these instructions is the administrator does not need to be a performance guru to realize these gains. These improvements also help save money requiring less capacity going forward.
In the "Tuning the JVM" section I have never been disappointed with the "Option 1" settings. Options 2 and 3 require the ability to place the application under load/stress test. If you do not have a load/stress test environment (i.e. you have to test in production) then stick with "Option 1".
Notice that "Tuning Performance" has several sections for both application developers and WebSphere administrators. This is because we all know that to realize the best performance gains one has to optimize the application code. Runtime tuning can realize 5-15% but application code improvements can see 300% and higher performance improvements.
polozoff 110000N2A2 Tags:  tcp connection ip over fail refused windows poor performance 3,938 Views
As I travel the world working performance problems I never see Microsoft Windows environments used outside the developer's desktop. Surprisingly these past couple of weeks I've been working in an environment where Microsoft Windows is used for the IBM HTTP Server tier with the WebSphere Application Server plug-in. Under normal operating conditions everything seems to work nominally.
However, much to my surprise, if we took down any of the application servers in the cluster of this very large cell I saw an anomaly. When the plug-in was attempting to route traffic to the downed application servers there seemed to be a really long lag on the connection refused processing. In fact, I was seeing least a second to get through the TCP/IP roundtrip. This made no sense to me. One of my colleagues, Keys Botzum, took a Java application and ran it on both Windows and Solaris. The application simply tried to connect to localhost (to eliminate any DNS lookups or network latency from the test) on a port no one was listening to and looped around 20 times. On Windows the test took slightly over 20 seconds. On Solaris, less than a second (which was the behaviour I was expecting on Windows).
If you are, or planning to, use Microsoft Windows on the IHS tier be aware of this strange failure scenario on Windows. I'll try to investigate and see if there are any Windows settings to help tune this. Though the plan is to move off Windows to Redhat Linux which right now sounds like the right move to me.
Happy Thanksgiving to everyone. I hope everyone was able to get a good meal and time with family today.
This week I'm writing to you from Seoul, South Korea (it is actually Friday the day AFTER Thanksgiving here yet the Macy's Thanksgiving parade I am watching via Slingbox is still on). I'm working with some colleagues here and doing some mentoring and skills transfer to help broaden the problem determination skills within IBM. Which brings me to today's topic. We encountered a classic application hang. Sometimes, but not all the time, the administrator would restart the application on WAS v8.5 and when the test team started to apply load to the application it would hang. Javacores from kill -3 showed all threads stuck in createOrWaitForConnection. Now for those of you who do follow my blog you probably know about the various techniques I've posted to debug this situation. As we had no access to the developers it was up to us to try and figure out what was causing the hang. Various random twiddling of various AIX OS level parameters didn't work (random changes never do). If they waited long enough the application would sometimes recover and start processing again.
After watching the testing go on for a while I finally suggested we increase the connection pool maximum size to 2n+1 where n = thread pool maximum. The setting the team had set the connection pool maximum was equal to the thread pool max. There was some disbelief that we should go down this path. Any good administrator knows that we want classic funneling where thread pool max is larger than connection pool max to make optimal use of memory, CPU, etc. They re-ran the test and after the 5th attempt realized that we would not recreate the hang. I've posted this command before:
netstat -an |grep ESTA |grep <port#> |wc -l
which gives a connection count to the database on port#. It may be double the value (showing source and destination connections) so you may have to divide the value in half. In our case with thread pool max at 50 and connection pool max set to 101 we were capturing as many as 90 established connections to the database at any one time. Obviously the developers of the application were following the anti-pattern of opening a second connection to the database before closing the first connection which resulted in the deadlock our team in Seoul was observing.
So why wasn't this deadlocking with each and every test? That comes down to randomness. Load tests while they may follow a set process and scripts there is some variability between each test. While it may not vary widely test after test the variability exists in terms of timing on the server. There can be various processes running, or not, at any given point in time. Load on the CPU or tasks the OS is doing can subtly change that timing inducing variability. Timing is key and in some cases the test team got lucky and the test would work. Other times the timing was off and the application would deadlock. This particular anti-pattern is very sensitive to timing. Get the wrong timing and the application will deadlock and hard.
In addition, when they would wait a while the application would recover. This is because underneath the cover of WAS it is quietly reclaiming connections because it knows how long threads have been holding open connections. Once a threshold (timeout) is reached WAS begins the active process of reclaiming connections that have been opened too long. This results in free connections being returned to the pool and the threads that were stuck in createOrWaitForConnection can resume processing.
What is the lesson learned here? When load testing an unknown application it might be worth setting connection pool max to 2n+1 of the thread pool max just to start with and using the command line netstat command (or your application monitoring tools) to see how many connections the application attempts to use. Then once experience is gained with the application reduce the size of the connection pool to something more reasonable based off the observed high water marks in the the connection pool utilization. This is a lot easier tactic than trying to debug an application that is deadlocked in createOrWaitForConnection.
After an application outage or an extremely negative performance event one needs to conduct root cause analysis to try and determine the next corrective course of action. Having done this many times let me document some of the steps done in the first/initial phases of trying to figure out just what happened.
The first task is to inventory what you have, how it is configured and deployed. This includes all software version information, configuration items for the application, pool sizes, etc.
Once that information is gathered understand what may be missing and asking a lot of questions. Is the software at the latest version or fixpack level. If not, why not? Is there anything in the patches subsequent to the version in production that may address the problems encountered? Are there any odd configurations (i.e. JDBC pool size is 3x larger than the thread pool size; 300 second timeouts, etc)? Understand odd configurations and try to determine why they exist. Often this is difficult because the people that initially configured and deployed the environment have moved on to other projects and the team you're dealing with is simply in maintenance mode.
2. Discovery / Data Collection
In order to solve a problem we have to have data about the problem. No data, no resolution because any decision is just a guess. Guesses do not work. My assumption here is we are investigating Java based applications.
a. Were thread dumps collected during the negative event? If not, why not? Thread dumps are collected using 'kill -3 <pid>' (this doesn't "kill" the process it just sends signal #3 to the JVM which is caught by the JVM and it dumps all the Java threads at that point in time) on Unix based systems. Collect thread dumps during all negative events in the future if they were not caught in the past. Thread dumps are a crucial piece of the puzzle to help narrow down what is going wrong.
b. Is verbose GC (garbage collection) enabled? If not, why not? Verbose (and the term is unfortunate as it is not that verbose) GC is another crucial piece of data to understanding what the memory utilization was like during the negative event.
c. If the application was written in house then initiate a code review. Software is written by humans and humans err. It could be a bug in the application that only kicks in during the appropriate planetary alignment event. Reviewing code, on a periodic basis, is a good idea in general even if you are not having any problems.
d. What backends are the applications accessing? Is there any information from the backend that would indicate participating in the negative event (i.e. log files, DB2 snapshots, etc)? It would not be the first time that some negative condition in the backend was causing a front end backlog. It could also be related to bugs in the application (see 2c above).
e. Are any application monitoring tools in place? Java is a robust environment that allows for rather detailed application monitoring of various factors like pool utilization, application response time, SQL response times, etc. Not having an application monitor in place simply limits the ability to understand what happened. Having an application monitor in place also allows for alerts to be issued when a negative event is detected. This allows for proactive actions to be taken by people who can troubleshoot the problem and hopefully fix it before the users ever notice.
f. Look in the application log files. There may be a indication of what is going on in the application logs. This really depends on how well the developers implemented logging in the application and may or may not be of any use. Fingers crossed!
Get through this initial set of steps and then you can go on to the next phase which is actually figuring out just what went wrong. Which I'll write about in my next installment.
The WebSphere Technical Conference in Berlin is well under way with the first two days completed. I'm presenting in the performance track with both a hands-on lab for performance troubleshooting and analysis and two sessions (repeated) on top 10 Performance Tuning settings for WebSphere and Performance Testing and Analysis. Despite airBerlin losing my checked on bag for a couple of days the event has been pretty exhilarating. It is great to be able to meet with like minded techies who are dealing with the challenges of day to day production environments.
The photo below is from a pedestrian crossing of a canal not far from the hotel. The autumn colors have hit the trees and looks really neat.
This is the page to follow if there seem to be any Maximo performance or stability problems.
Report scheduler enhancements in Maximo v7.5. As with any online transaction application most enterprises need to pull reports from their environment. Reports tend to be (a) scheduled to repeat and (b) heavy users of CPU and memory. Therefore having more control on the report scheduler is a good thing to look at in Maximo v7.5.
In a recent review meeting on a problem with a high volume application many of the same questions that have been asked in the past were brought up. How does one prevent one problem from cascading into separate, unrelated facets of the application. On my old blog I spoke about circuit breakers in the specific case of a loop gone haywire. There are other kinds of circuit breakers that can be placed in applications that I have seen and proven work well.
One of the ones I tend to like and haven't really blogged about much allow the operations folks to disable specific functions of an application. This is easily facilitated if the application is well designed (i.e. functions are easily identifiable by examining the HTTP request itself) or is compartmentalized (i.e. separate functions are handled by separate logical clusters) where one cluster of servers only handles the "search" functions because we know that search will tend to exhaust resources vs the "checkout" function which we want to run 100% of the time so that every user that wants to can purchase the goods in their shopping cart. The beauty of this set up is that if any specific function, as detected through the application monitoring infrastructure, is experiencing a failure or is causing an unexpected bottleneck can quickly trip the circuit breaker and shunt any following requests to a "Sorry, not available" page.
The ability of this type of circuit breaker is key for a couple of reasons. First and foremost it addresses the fact that a failure of some sort is in progress and even though it hasn't been fixed we can quickly move traffic to another path that at least gives the end user a response. This avoid additional requests from overwhelming the production environment and having to restart all the servers to clear things up. The other reason is that it also allows for more sharing of the infrastructure because we have a plan to follow in the our runbook where we can quickly alleviate the problem by simply turning off the spigot.
I have seen two different approaches to solving this problem. In the case of the infrastructure if the functions of the application are easily identifiable or clustered independently then the operations team can easily modify either load balancing rules or make changes to the HTTP plugin configuration. I particularly like this one because as soon as the operations team has identified a particular fault they can trip the appropriate circuit breaker and get started with the problem determination steps.
Another approach, which can be used in combination with the previous solution, is to actually build into the application circuit breaker checks at various points in the code. This would then cause a read from the database to check on a bit in the environment to see if it should continue processing the current function or not. Similar to the loop circuit breaker I referenced above where if we know our loops should never iterate more than 500 times to have them abort and throw an exception on the 501st iteration. If there is a consensus among the operations and development teams that some piece of functionality has broken and bit can be flipped in the database and that function is either disabled and directs to an error page or can alternatively provide back some cached value (if possible, it depends on the kind of data the user was going after).