WebSphere Peformance - Alexandre Polozoff's Point of View
With WebSphere Application Server v8 the IBM JRE provides a new garbage collection policy known as balanced. One should consider trying the balanced policy if running on the 64-bit JVM using a Java heap size over 4GB and still experiencing occasional long pauses with the gencon policy. It does impose a slight performance hit but based on how some applications are written or coded it may be necessary for the runtime operations team to try this policy in an attempt to avoid large pause times. The balanced policy can also take advantage of non-uniform memory access (NUMA) hardware architecture available on System x® and System p® using current versions of AIX®, Linux® or Windows®.
polozoff 110000N2A2 1,233 Views
Another excerpt from our WebSphere Application Server Performance Cookbook, due for external publication sometime in the near future, on determining the health of a JVM. This may or may not look like the final publication.
"A common question is how does one determine how efficiently is the JVM performing and what metrics point to a JVM that is in, or heading toward, distress?
Once you have determined that the application is not healthy follow the appropriate MustGather and open a PMR with IBM Support."
Application Monitoring Identifying and resolving problems in the cloud with IBM SmartCloud Monitoring
polozoff 110000N2A2 Tags:  cause analysis application troubleshooting cloud patterns root smartcloud monitoring pureapplication 2,617 Views
IBM's new application monitoring solution
During the Q&A session of a customer performance presentation one of the questions asked about performance in migrating from a bare metal environment to a virtualized cloud environment. This is a really good question! As I've said many times when it comes to performance you can not manage what you can not measure. The cloud doesn't escape the rules of Computer Science and Performance 101.
This is where IBM's SmartCloud Monitoring - Application Insight can help. The features page describes how this new application monitoring solution can provide insight into application performance and user experience. Providing capabilities such as dynamic monitoring of cloud applications for a variety of public and private cloud providers including Amazon EC2, VMware and IBM's PureApplication System (IPAS). Diagnostic drill downs and embed monitoring technology to aid troubleshooting and root cause analysis. The resource page provides articles on topics such as service management and proactive application management. A wiki on best practices, a forum for technical questions, a blog for the latest news and updates and a community to engage with IBM experts or peers at other enterprise organizations is all provided from the community page. Finally, documentation and an extensive knowledge base can be found on the SmartCloud support page.
Why is this important? Primarily because clouds are comprised of many VM instances. If not properly configured or tracked the underlying resources the VMs need (i.e. CPU, RAM, network, disk, etc) can easily be over committed resulting in perplexing performance problems. Inside the VM everything may look nominal but without visibility into the cloud itself troubleshooting is near impossible prolonging the negative performance and end user experience. No one wants to prolong a bad end user experience.
In addition, production monitoring data can be fed back into the cloud capacity planning organization. This allows for production data to be used in their calculation models to maintain Service Level Agreements (SLAs) and availability requirements. Cloud infrastructure, while seemingly boundless, can suffer from resource availability as applications grow and mature if no one is monitoring the environment.
From an administration and infrastructure perspective cloud technologies are presenting new and exciting technologies to further simplify those tasks. Last year when I was working on the IPAS performance team I was really enthused by patterns and the powerful capabilities behind them especially where consistency and repeatability is a must. Even more impressive is the symbiotic integration of various IBM technologies like in the mobile space with Worklight developing their IPAS support and mobile application platform patterns.
[Edit to correct typo and added tags and title and a couple of links to more specific content]
On Solaris this tuning page has the details that I often follow for Option 1.
polozoff 110000N2A2 1,216 Views
A lot of applications use JMS for asynchronous messaging. And as many of you know I really am into scalability especially when working with high volume applications. There is a two part article from one of my UK colleagues for a flexible and scalable WebSphere MQ topology pattern Part 1 and Part 2 which contains code samples.
I am often posed with the philosophical question
"Is a complete production outage a performance problem?"
From a performance perspective it sure is!
Today's blog post is about a performance aspect most folks in IT miss completely: backup and restore.
"Wait a minute here Mr. Polozoff! We take regular backups!"
"Ah, but have you ever actually taken that backup and restored it to another server?" is my typical response. And you can probably guess the answer I get and it is rarely positive. That is the point of today's post. I was reminded of this from a recent discussion with a colleague that was conducting a review of a client environment. Backup is a regular activity in just about every IT shop but what is not a regular activity is actually seeing if the backup can be restored to another machine. I actually saw the devastating effect this had at one of the clients I was supposed to engage with. On my second day I arrived at the office to be met by the manager I was working with. He had bad news. The production database failed and the backup failed to restore. In fact, none of the backups they had could restore the database. Since I was there to look at some problems in production that effectively ended that engagement because I couldn't study the application server environment if it had no database to connect to.
That is a hard way to learn an important lesson. No matter how often you take a backup it needs to be regularly tested and restored to another server ensuring the backup is what we think it is and not just a jumble of bits that ultimately has zero value.
I had a blog post on socketRead issues causing hung thread in this blog post and using timeouts when communicating with a database back in 2009. These problems don't happen very often anymore as networks and databases tend to run on fairly robust environments. You can imagine my surprise when late last week I received an email from a colleague working with an application suffering from the same symptoms in that blog.
The reason for setting the timeouts is to be able to fail fast as opposed to the application appearing to be non-responsive.
However, what to do if the timeout doesn't seem to be kicking in? In this case the first thing to do is to open a PMR with IBM Support.
Data really needs to be collected in at least two places. On the application server and on the database.
For the database http://www.ibm.com/developerworks/data/library/techarticle/dm-0812wang/ provides information on how to use db2top to collect data when the problem is occurring.
Then for the application server (this happens to be for BPM) http://www-01.ibm.com/support/docview.wss?uid=swg21611603 there are various sets of mustgathers to collect data in order for IBM Support to run analysis on.
Networks can also have hiccups and by running tcpdump on both the application server and database side one can use various protocol inspection tools like Wireshark to look at the underlying network communications to see what, if any, problems may be occurring there.
Your application is slow. You get a thread dump and look in the javacore and see lots of threads is ClassLoader.loadClass() with one thread holding the lock. You need to check your FFDC logs and look for "Too many open files." This means you haven't tuned the OS ulimit parameters and probably many others. Look in the InfoCenter for performance tuning and operating systems and pick the page for your OS. This should be the first link in the InfoCenter you access after you install WebSphere Application Server.
Edit: added link to the WAS v7 Infocenter page.
IBM provides a tool to help with analysis of client side performance. One of the benefits for doing this is to help identify static objects that are not being cached at the browser. In addition, for JSF based applications, this tool helps identify large client side caches.
Happy Thanksgiving to everyone. I hope everyone was able to get a good meal and time with family today.
This week I'm writing to you from Seoul, South Korea (it is actually Friday the day AFTER Thanksgiving here yet the Macy's Thanksgiving parade I am watching via Slingbox is still on). I'm working with some colleagues here and doing some mentoring and skills transfer to help broaden the problem determination skills within IBM. Which brings me to today's topic. We encountered a classic application hang. Sometimes, but not all the time, the administrator would restart the application on WAS v8.5 and when the test team started to apply load to the application it would hang. Javacores from kill -3 showed all threads stuck in createOrWaitForConnection. Now for those of you who do follow my blog you probably know about the various techniques I've posted to debug this situation. As we had no access to the developers it was up to us to try and figure out what was causing the hang. Various random twiddling of various AIX OS level parameters didn't work (random changes never do). If they waited long enough the application would sometimes recover and start processing again.
After watching the testing go on for a while I finally suggested we increase the connection pool maximum size to 2n+1 where n = thread pool maximum. The setting the team had set the connection pool maximum was equal to the thread pool max. There was some disbelief that we should go down this path. Any good administrator knows that we want classic funneling where thread pool max is larger than connection pool max to make optimal use of memory, CPU, etc. They re-ran the test and after the 5th attempt realized that we would not recreate the hang. I've posted this command before:
netstat -an |grep ESTA |grep <port#> |wc -l
which gives a connection count to the database on port#. It may be double the value (showing source and destination connections) so you may have to divide the value in half. In our case with thread pool max at 50 and connection pool max set to 101 we were capturing as many as 90 established connections to the database at any one time. Obviously the developers of the application were following the anti-pattern of opening a second connection to the database before closing the first connection which resulted in the deadlock our team in Seoul was observing.
So why wasn't this deadlocking with each and every test? That comes down to randomness. Load tests while they may follow a set process and scripts there is some variability between each test. While it may not vary widely test after test the variability exists in terms of timing on the server. There can be various processes running, or not, at any given point in time. Load on the CPU or tasks the OS is doing can subtly change that timing inducing variability. Timing is key and in some cases the test team got lucky and the test would work. Other times the timing was off and the application would deadlock. This particular anti-pattern is very sensitive to timing. Get the wrong timing and the application will deadlock and hard.
In addition, when they would wait a while the application would recover. This is because underneath the cover of WAS it is quietly reclaiming connections because it knows how long threads have been holding open connections. Once a threshold (timeout) is reached WAS begins the active process of reclaiming connections that have been opened too long. This results in free connections being returned to the pool and the threads that were stuck in createOrWaitForConnection can resume processing.
What is the lesson learned here? When load testing an unknown application it might be worth setting connection pool max to 2n+1 of the thread pool max just to start with and using the command line netstat command (or your application monitoring tools) to see how many connections the application attempts to use. Then once experience is gained with the application reduce the size of the connection pool to something more reasonable based off the observed high water marks in the the connection pool utilization. This is a lot easier tactic than trying to debug an application that is deadlocked in createOrWaitForConnection.
Some tools collect data in counts but that isn't a very useful number when trying to understand how much time
For example, on a 2.993GHz machine, a function with 36040 cycles would convert as (36040/2993)/1000, which is 12.04143 usec or .01204 milliseconds.
From this link
polozoff 110000N2A2 1,063 Views
Released earlier today fix pack 1 for WebSphere Application Server v8.5.5.
polozoff 110000N2A2 1,379 Views
A few days ago a few colleagues contacted me about my article on proactive application monitoring. They're building some templates for monitoring applications in the cloud and they had some questions specifically around thresholds for many of the metrics I had listed. For example, one of the questions was around datasource connection pool utilization. Is it reasonable to set thresholds for warnings if the connection pool was 85% utilized and critical if it was 95% utilized? Likewise, similar questions around CPU utilization and would a warning at 75% and critical alerts at 90% be reasonable?
The answer is, (drum roll please) it depends.
No two applications are alike. There are low volume, rarely used applications that may never get above 2% connection pool utilization. Conversely, there are high volume applications where the connection pool can be running at 90-100% utilization. Better metrics to watch (via the PMI metrics) are (a) how many threads had to wait for a connection from the connection pool and (b) how long those threads had to wait. Both of those metrics directly impact the throughput and response time of the application.
Same with CPU utilization. Some organizations like to run their servers hot over 90% utilization because they have spare, passive capacity they can bring online. Others like to run at less than 50% utilization because they want to have spare capacity in an active-active modus operandi.
Setting useful thresholds depends on understanding the organization's Service Level Agreements (SLAs) and the application's Non Functional Requirements (NFRs).
polozoff 110000N2A2 1,477 Views
It is being reported in the news there is another OpenSSL bug around the SSL handshake and being able to force the handshake to use a less secure encryption method. This is interesting because I was troubleshooting a problem a couple of months back in two supposedly identical prod/test environments but test was negotiating tlsv2 but the prod environment was negotiating sslv3. We eventually solved that problem by changing the configuration of the prod environment to use only tlsv2. But curious how this was showing up in the prod environment. May be time to circle back and take a closer look.
As applications grow over time they tend to add features and functions and then one day they run out of native heap as more and more Java classes are piled in.
Edit: Oct 30, 2015 and Nov 17, 2015
Native OOM (NOOM) landscape continues to shift. An argument to offset the heap to a different area in the address space is much better (-Xgc:preferredHeapBase). With this new argument, one can place the Java heap allocated past the initial 4g of address space, allowing all native code to use almost all of the lower 4g of space.
polozoff 110000N2A2 Tags:  is accept the to set ensure browser certificate proxy jmeter 1 Comment 6,015 Views
Testing requires a tool and for one of my projects I'm using JMeter. I'm testing an https based site and was just having a hard time figuring out what was going on. I kept seeing an error in the JMeter "view results tree" that just said "ensure browser is set to accept the jmeter proxy certificate". I started researching that phrase and got nowhere quickly.
However, in the jmeter/bin subdirectory I found the jmeter.log file. In there I found a java.security.NoSuchAlgorithmException and referencing the SunX509 KeyManagerFactory. Ah ha, yes, I'm running the IBM JRE and not the Sun version.
Unfortunately changing the jmeter.properties proxy.cert.factory=IbmX509 (and of course uncommenting it) had no effect and I got the same SunX509 exception. I decided to try it at the command line as:
jmeterw -Dproxy.cert.factory=IbmX509 and voila the problem went away!
For as long as I can remember the most debated Java topic has been the difference in opinion on the heap size minimum = maximum with lots of urban myths and legends that having them equal was better. In a conversation with a number of colleagues and Chris Bailey who has lead the Java platform for many years he clarified the settings for the IBM JVM based on generational vs non-generational policy settings.
"The guidance [for generational garbage collection policy] is that you should fix the nursery size: -Xmns == -Xmnx, and allow the tenured heap to vary: -Xmos != -Xmox. For non generational you only have a tenured heap, so -Xms != -Xmx applies.
A link to Chris Bailey's presentation on generational garbage collection http://www.slideshare.net/cnbailey/tuning-ibms-generational-gc-14062096
[edit to correct typo, added tags]
I'm working on a Liberty server (this is the latest beta I downloaded a couple of days ago) and using the installUtility I'm getting the following error.
# bin/installUtility install adminCenter-1.0
CWWKF1219E: The IBM WebSphere Liberty Repository cannot be reached. Verify that your computer has network access and firewalls are configured correctly, then try the action again. If the connection still fails, the repository server might be temporarily unavailable.
I then found out about a command to help try and figure out what is wrong
bin]# ./installUtility find --type=addon --verbose=debug
[6/25/15 10:57:53:125 CDT] Failed to connect to the configured repository:
[6/25/15 10:57:53:128 CDT] com.ibm.ws.massive.RepositoryBackendIOException: Failed to read properties file https://public.dhe.ibm.com/ibmdl/export/pub/software/websphere/wasdev/downloads/assetservicelocation.props
Will update when I have more details on why I'm getting the ClassNotFoundException.
and that resolves the issue. A defect has been raised to have the script use the Java we supply instead of the machine's.
[Edited Aug 25 to add
I also needed to update /etc/host.conf to enable hosts file lookup and then add entries for
to /etc/hosts file
The Aug 2015 beta seems to have made a number of fixes to installUtility so if you're on an older beta get the latest.]
Someone takes a javacore during what looks to be a hung app server and notices it contains lots of threads in socketRead. This is symptomatic of a slow back end whether it is a database, Web service, etc. An application is as strong as its weakest link. If the backend the application depends on is unable to respond in a timely manner then there is nothing that can be tuned at the application layer except for aggressive timeouts to protect the application from getting stuck. Hangs like these typically happen under high load/traffic conditions. It is important that the group that maintains the backend is aware of an issue with their tier and they need to fix it.
This is the page to follow if there seem to be any Maximo performance or stability problems.
Report scheduler enhancements in Maximo v7.5. As with any online transaction application most enterprises need to pull reports from their environment. Reports tend to be (a) scheduled to repeat and (b) heavy users of CPU and memory. Therefore having more control on the report scheduler is a good thing to look at in Maximo v7.5.
polozoff 110000N2A2 1,881 Views
I am often asked specifically what metrics should be monitored in WebSphere Application Server. This list can be considered a starting point. The WAS admin will frequently use "Custom" PMI setting and disable any metrics not in this list.
Suggested: Pool Size, Active Thread Count
Optional: ActiveTime (avg. time in use), ConcurrentHungThreadCount.
For WESB, per-mediation: Mediation.ThreadCount
Database - JDBC Data Source Connection Pools.
Messaging - JMS Queue Connection Factory Connection Pools.
Suggested: PoolSize, WaitingThreadCount, WaitTime,
Suggested: BufferedReadBytesCount, BufferedWriteBytesCount
Suggested: AvailableMessageCount, LocalMessageWaitTime
Transactions (per application server)
Memory Management. (Java GC).
Suggested: Heap Size, % Free after GC, % Time spent in GC.
polozoff 110000N2A2 Tags:  machines sla small peformance planning nfr failover frame large capacity 2,124 Views
Yesterday at my "Performance Testing and Analysis" talk a question was asked is it better to have one large machine and virtualize the environment or to have lots of small machines?
Both strategies work. If deciding to go with large capacity frames then you want to have at least 3 frames. This way if one frame is taken out of service there are at least two other frames running. Otherwise, if one builds out only 2 large frames and one is taken out of service then the remaining frame becomes a single point of failure. I don't like SPOFs and so would have at least 3. This is if one frame can take the entire production load during the outage. That information has to be culled from the performance testing to see if there is enough capacity in one frame to carry the entire production load. If not then more likely than not there will need to be additional frames to be able to ensure that even if half the infrastructure is taken out of service the remaining frames can carry the entire production workload.
Likewise, having lots of smaller machines also works. Odds are less likely for a massive hardware outage with smaller machines so as one fails it can be safely taken out of service and replaced without impacting the production workload.
Which strategy is better? In my opinion they are both valid strategies as long as the proper capacity planning is conducted to ensure that when an outage occurs (and think worst case scenario here) that the remaining infrastructure is able to continue processing the production workload without impacting the SLA (Service Level Agreement including the non-functional requirements [i.e. response time, resource utilization, etc]). You do have a defined SLA, right?
polozoff 110000N2A2 1,075 Views
A few years ago I blogged about how adding JSESSIONID logging to the access log helps identify which cluster member a user was pinned to. It turns out this also helps troubleshoot another interesting problem.
A WebSphere Application Server administrator noted that the session count on one of their JVMs in the cluster was getting far higher session counts than any other JVM in the cluster. So much so it was like a 3:1 imbalance in total number of sessions in the JVM. We applied the JSESSIONID logging and captured all the session ids. Through various Unix utilities (cut, sort, uniq, etc) we ended up with a prime suspect. One session was calling the /login page 10-20 times per second and had eclipsed every other session by over 10x the number of requests.
Why did we go down this path? We were able to see through the PMI data that the session manager in WebSphere Application Server was invalidating sessions. So we knew it wasn't an issue in the product of not deleting sessions. Also, with one JVM in the cluster creating more sessions than the other JVMs is suspicious. I would have expected to have seen higher load across the cluster. In addition, they have seen the behaviour move around the cluster every few months. That lead me to believe this was like a replay attack. Someone at some point captured a response with a JSESSIONID and was then using that JSESSIONID over and over again until some event caused it to capture a new JSESSIONID (most likely from a failover event as the cluster went through a rolling restart). That behaviour was curious! The fact it was smart enough to realize the HTTP header content changed and adapted was interesting.
So next time you see one or more JVMs with considerably higher session counts than the other JVMs in the same cluster you can use the same troubleshooting methodology to track down who the suspect is. Especially if your application is Internet-facing meaning anyone can start pinging your application.