For some reason I had problems downloading the Liberty Profile 9 beta getting 404 not found responses. This link sent to me by a colleague worked. Not sure why but if you're having trouble downloading the beta follow the link.
WebSphere Peformance - Alexandre Polozoff's Point of View
polozoff 110000N2A2 1,126 Views
polozoff 110000N2A2 1,373 Views
Released earlier today fix pack 1 for WebSphere Application Server v8.5.5.
polozoff 110000N2A2 1,434 Views
In over 3 decades of IT one of the consistent themes of my job has been transforming under performing business units for various clients. Some of the scenarios I've been called in to transform have been
Those are just three examples of under performing scenarios each of which affects the end user's experience. If those user's are not locked into the brand they could potentially leave and go to a competitor and never come back. The underlying theme in the scenarios needs to be addressed by someone with extensive operational and application development background in information technology. Here are some of the strategies I have used to transform under performers to over achievers.
One of the first things I learned in my career was taking charge. This is not as easy as it sounds. First, it meant having extreme confidence in myself and my decisions. In the early days sometimes I was successful but other times I was not and needed to step back and reconsider what the next steps had to be. What follows are some of my lessons learned with taking charge.
Under performing business units tend to have difficulty making decisions. Quite simply there are too many ways to do any single technical task or effort that solves a problem. When an organization runs by consensus it is even harder to make a decision. I learned early in my career that giving any organization a choice even as simple between effortA and effortB was futile. I have spent countless hours in meetings over the risks and ramifications of the two, or more, choices. I found that providing one solution and only one solution as the most expeditious way to move an organization forward. All the while keeping plan B in the back of my mind in case my first decision hit a road block. However, at this point I have had enough experiences and previous failures to pretty much nail what plan A needs to be and how to execute it.
Have a plan and articulate it
A plan means nothing if the plan makes no sense to anyone else. When I provide the direction to move an organization forward it comes with a step by step plan that addresses
Even the short term, immediate tactical steps may have several iterations of different efforts that can span weeks or months depending on how severe the problem. Inevitably regardless of the scope of the actions requires actions in both operations and application development. Though sometimes I got lucky and it was only one or the other. But not often.
Start with the basics. Have things like the recommended OS level tunings been applied? If not then that is the first part of the plan.
It should go without saying that any plan should be testable outside of the production environment. However, as applications mature and the user base grows so does the operational IT environment. Test as much as possible and keep production changes down to one change per change window with a tested back out plan. Which brings us to the next topic.
Repeatability (AKA scripting)
To minimize risk a robust operational IT infrastructure requires the ability to perform tasks over and over again understanding exactly what the resulting output should be. Whether it is configuring a configuration item or deploying an application we should clearly understand the end result. Scripts developed and tested in test that work can be promoted to production. I'll note here that with the advent of devOps this facet of IT operations has become significantly easier and more robust than in the past. In some of the testing I manage internally for very large scale performance testing of 10,000 Liberty servers in a SoftLayer environment I know that our gradle scripts will build out the environment from scratch the same way each and every time.
Change is a scary word for a lot of people because it also means risk to the business. This is why change processes should be followed meticulously. Having redundancy and lots of it also reduces risk during a change. If redundancy doesn't exist then it needs to be the first part of the transformation plan.
Operational, infrastructural changes versus application fixes
I have always separated operational changes from application fixes in the same change window. Depending on the speed that application fixes can be identified, coded and tested ultimately depends on how quickly application fixes will be introduced. Sometimes it is an easy code fix but other times whole architectures or designs need to be re-worked due to poor decision making. Though the same level of complexity can exist in the IT infrastructure slowing down the speed with which change can be made because developing scripts or testing can take a lot of time. And testing the back out plan can take longer than testing the solution.
Move the organization to be proactive
Under performing organizations typically react to problems. That means that the problem(s) may have been impacting the end user's experience for some time. Application monitoring is key to helping an organization react proactively to problems. I once lead a business unit that was penalized every quarter for server uptime not meeting SLAs to collecting a bonus every quarter for exceeding the SLAs. All by installing and configuring the right tools to allow the organization to identify problems and notify the right people to rectify the problem before the end user ever noticed.
One thing I try to leave with each business unit I've transformed is how to innovate. This is how the business unit goes in to high achiever mode. Mentoring them in how to think differently about problems and the approaches they take in IT. Encouraging wild ducks so to speak.
Transforming under performing business units takes as much leadership as it takes technical prowess. I have found that the more prominent the problem (e.g. complete application outage) the easier it was to troubleshoot and fix. Intermittent issues or glitches like every once in a while our response time goes from 30ms to 560ms tended to be more difficult as capturing data (nonetheless the right data) at the time of the problem can be difficult. But that only means more effort needs to be spent on the application monitoring tools in order to flush out the necessary data.
polozoff 110000N2A2 Tags:  high multiplytolen sparc solaris subn muladd montreduce squaretolen biginteger cpu hang 1,439 Views
On Solaris on Sparc we're seeing a scenario of high CPU with the majority threads doing work in similar thread stacks (see below) with the top of the stack sometimes in montReduce, squareToLen, multiplyToLen, subN. Obviously the scenario is a number of new TLS connections are incoming but a bug identified as 8153189 causes high CPU. However, on Solaris on Sparc platform it appears there is no fix available even though there is a fix available for Solaris on x86 (however it is not enabled by default you have to use -XX parameters to enable the fix. See earlier link). I am still waiting on Java to confirm the fix status.
This particular scenario is playing out in the tiers between IHS and WAS. The workaround is to minimize the frequency of TLS handshakes by setting the IHS configs to maximize settings so the connections persist and are not destroyed and to reconfigure WAS to have unlimited requests per connection. See addendum at the end of this post:
"WebContainer : 123” daemon prio=3 tid=0x00123456 nid=0xfffa runnable [0x1234568a0]
Servers > Application servers > $SERVER > Web container settings > Web container transport chains > * > HTTP Inbound Channel > Select "Use persistent (keep-alive) connections" and "Unlimited persistent requests per connection" (and then restart the server)
polozoff 110000N2A2 1,491 Views
I don't normally get involved with User Interface (UI) performance but here is a good article that describes some tips and techniques to make your UI appear snappier. The few times I have been involved with UI performance issues I've used IBM Page Detailer which is a free download that also has some IBM Research papers and links around various performance improvements.
If you do any work around Java Server Faces (JSF) you'll find IBM Page Detailer pretty handy.
polozoff 110000N2A2 1,628 Views
A few years ago I blogged about how adding JSESSIONID logging to the access log helps identify which cluster member a user was pinned to. It turns out this also helps troubleshoot another interesting problem.
A WebSphere Application Server administrator noted that the session count on one of their JVMs in the cluster was getting far higher session counts than any other JVM in the cluster. So much so it was like a 3:1 imbalance in total number of sessions in the JVM. We applied the JSESSIONID logging and captured all the session ids. Through various Unix utilities (cut, sort, uniq, etc) we ended up with a prime suspect. One session was calling the /login page 10-20 times per second and had eclipsed every other session by over 10x the number of requests.
Why did we go down this path? We were able to see through the PMI data that the session manager in WebSphere Application Server was invalidating sessions. So we knew it wasn't an issue in the product of not deleting sessions. Also, with one JVM in the cluster creating more sessions than the other JVMs is suspicious. I would have expected to have seen higher load across the cluster. In addition, they have seen the behaviour move around the cluster every few months. That lead me to believe this was like a replay attack. Someone at some point captured a response with a JSESSIONID and was then using that JSESSIONID over and over again until some event caused it to capture a new JSESSIONID (most likely from a failover event as the cluster went through a rolling restart). That behaviour was curious! The fact it was smart enough to realize the HTTP header content changed and adapted was interesting.
So next time you see one or more JVMs with considerably higher session counts than the other JVMs in the same cluster you can use the same troubleshooting methodology to track down who the suspect is. Especially if your application is Internet-facing meaning anyone can start pinging your application.
polozoff 110000N2A2 1,630 Views
Some interesting updates to licensing WebSphere Application Server (Express, Base, Network Deployment and Liberty Core) today.
Quoting the "At a Glance" from the link above:
polozoff 110000N2A2 1,633 Views
The WebSphere Application Server Performance Cookbook has been published! I've hinted about this book in previous excerpt postings. Now you can read it in its entirety. Get ready for a long read. The book encapsulates the WAS/Java/Cloud performance knowledge of some of the smartest people in IBM.
polozoff 110000N2A2 1,686 Views
A lot of applications use JMS for asynchronous messaging. And as many of you know I really am into scalability especially when working with high volume applications. There is a two part article from one of my UK colleagues for a flexible and scalable WebSphere MQ topology pattern Part 1 and Part 2 which contains code samples.
polozoff 110000N2A2 1,720 Views
Another excerpt from our WebSphere Application Server Performance Cookbook, due for external publication sometime in the near future, on determining the health of a JVM. This may or may not look like the final publication.
"A common question is how does one determine how efficiently is the JVM performing and what metrics point to a JVM that is in, or heading toward, distress?
Once you have determined that the application is not healthy follow the appropriate MustGather and open a PMR with IBM Support."
polozoff 110000N2A2 1,772 Views
I've been involved in some performance testing of various application deployment tools that happen to be open source. The one common theme I have run into are buggy software. Apparently open source tools are following the continuous delivery strategy pushing out new builds and frequently. However, testing seems to be sorely lacking. At least once a week some part of my deployment fails because of a recent update. Something that worked last week quit working this week.
In addition, dependencies appear to be tricky for the open source owner. The number one bug has been the new version has some how messed up the dependencies and refuses to run.
I would really like to see some comments on how other people manage these bugs. I would be pretty upset if my application deployment ground to a halt because of a buggy release.
polozoff 110000N2A2 1,783 Views
I have had a number of conversations with various colleagues lately around "optimal performance." It kind of reminds me of an old post I had about predicting capacity and application performance for a non-existent application (hint: it can't be done).
One scenario we were discussing was an application that was currently in one of its testing cycles. And the question came up, well, could we have another team test the same application and see if they could get better performance or at least validate the performance we are seeing is consistent? That seemed an unusual request. It turns out after some questioning that one team was not confident in the people they had or the results they were seeing. My answer to that was they should pull in the right resources to help. Having a second performance team test the same application made little sense. Just standing up a test environment would have been a big effort not to mention getting floor space in the data center.
So how do you determine if the performance your application sees is optimal performance?
The answer is two fold. First, like my previous post of when is enough performance tuning really enough is driven by business requirements (SLAs) and having the appropriate application monitoring tools in place to verify the application is meeting those SLAs.
The second is a little greyer. While SLAs should hold a lot of the requirements one that is commonly not in scope is capacity. It is difficult to predict capacity until performance testing has been completed. While the application may meet response time and transaction throughput requirements we may discover that, especially in high volume environments, the necessary hardware infrastructure capacity to support the application is going to cost a lot of money (or more money than was expected). Much like the conclusion in my previous article that in this case there may be a reason to continue performance tuning (including application development to reduce resource utilization) to try and reduce the infrastructure cost impact. However, at some point in time a decision will have to be made if the performance tuning effort is exceeding the cost differential of the increased capacity needed to support the application.
The popular XC10 appliance has periodic firmware upgrades. However, a recent client was experiencing slower response times from the XC10 running with a recent firmware upgrade. After much looking we saw in the AIX client and XC10 packet traces that the ACK from the AIX client was taking 200ms on almost every response from the XC10. This anomaly can be remedied by setting on AIX
no -o -o tcp_nodelayack=1
The topic gets into interesting TCP/IP conversations about Nagle algorithms, MTU, piggybacking ACKs on data packets and timeouts that I was not aware of but the AIX level TCP/IP configuration change resolved the problem. A similar setting exists for Linux (see techrepublic link below) Note: this can increase the number of packets on the network however in our testing after making the change we did not see that side effect. However, collect the necessary tcpdump/iptrace to trust and verify.
Edit Nov 20 to add some references and interesting related reading
polozoff 110000N2A2 1,876 Views
It is being reported in the news there is another OpenSSL bug around the SSL handshake and being able to force the handshake to use a less secure encryption method. This is interesting because I was troubleshooting a problem a couple of months back in two supposedly identical prod/test environments but test was negotiating tlsv2 but the prod environment was negotiating sslv3. We eventually solved that problem by changing the configuration of the prod environment to use only tlsv2. But curious how this was showing up in the prod environment. May be time to circle back and take a closer look.
polozoff 110000N2A2 1,884 Views
A few days ago a few colleagues contacted me about my article on proactive application monitoring. They're building some templates for monitoring applications in the cloud and they had some questions specifically around thresholds for many of the metrics I had listed. For example, one of the questions was around datasource connection pool utilization. Is it reasonable to set thresholds for warnings if the connection pool was 85% utilized and critical if it was 95% utilized? Likewise, similar questions around CPU utilization and would a warning at 75% and critical alerts at 90% be reasonable?
The answer is, (drum roll please) it depends.
No two applications are alike. There are low volume, rarely used applications that may never get above 2% connection pool utilization. Conversely, there are high volume applications where the connection pool can be running at 90-100% utilization. Better metrics to watch (via the PMI metrics) are (a) how many threads had to wait for a connection from the connection pool and (b) how long those threads had to wait. Both of those metrics directly impact the throughput and response time of the application.
Same with CPU utilization. Some organizations like to run their servers hot over 90% utilization because they have spare, passive capacity they can bring online. Others like to run at less than 50% utilization because they want to have spare capacity in an active-active modus operandi.
Setting useful thresholds depends on understanding the organization's Service Level Agreements (SLAs) and the application's Non Functional Requirements (NFRs).