Some interesting updates to licensing WebSphere Application Server (Express, Base, Network Deployment and Liberty Core) today.
Quoting the "At a Glance" from the link above:
polozoff 110000N2A2 3,645 Views
Some interesting updates to licensing WebSphere Application Server (Express, Base, Network Deployment and Liberty Core) today.
Quoting the "At a Glance" from the link above:
Application Monitoring Identifying and resolving problems in the cloud with IBM SmartCloud Monitoring
polozoff 110000N2A2 Tags:  cause analysis application cloud troubleshooting patterns root monitoring smartcloud pureapplication 5,260 Views
IBM's new application monitoring solution
During the Q&A session of a customer performance presentation one of the questions asked about performance in migrating from a bare metal environment to a virtualized cloud environment. This is a really good question! As I've said many times when it comes to performance you can not manage what you can not measure. The cloud doesn't escape the rules of Computer Science and Performance 101.
This is where IBM's SmartCloud Monitoring - Application Insight can help. The features page describes how this new application monitoring solution can provide insight into application performance and user experience. Providing capabilities such as dynamic monitoring of cloud applications for a variety of public and private cloud providers including Amazon EC2, VMware and IBM's PureApplication System (IPAS). Diagnostic drill downs and embed monitoring technology to aid troubleshooting and root cause analysis. The resource page provides articles on topics such as service management and proactive application management. A wiki on best practices, a forum for technical questions, a blog for the latest news and updates and a community to engage with IBM experts or peers at other enterprise organizations is all provided from the community page. Finally, documentation and an extensive knowledge base can be found on the SmartCloud support page.
Why is this important? Primarily because clouds are comprised of many VM instances. If not properly configured or tracked the underlying resources the VMs need (i.e. CPU, RAM, network, disk, etc) can easily be over committed resulting in perplexing performance problems. Inside the VM everything may look nominal but without visibility into the cloud itself troubleshooting is near impossible prolonging the negative performance and end user experience. No one wants to prolong a bad end user experience.
In addition, production monitoring data can be fed back into the cloud capacity planning organization. This allows for production data to be used in their calculation models to maintain Service Level Agreements (SLAs) and availability requirements. Cloud infrastructure, while seemingly boundless, can suffer from resource availability as applications grow and mature if no one is monitoring the environment.
From an administration and infrastructure perspective cloud technologies are presenting new and exciting technologies to further simplify those tasks. Last year when I was working on the IPAS performance team I was really enthused by patterns and the powerful capabilities behind them especially where consistency and repeatability is a must. Even more impressive is the symbiotic integration of various IBM technologies like in the mobile space with Worklight developing their IPAS support and mobile application platform patterns.
[Edit to correct typo and added tags and title and a couple of links to more specific content]
I had a blog post on socketRead issues causing hung thread in this blog post and using timeouts when communicating with a database back in 2009. These problems don't happen very often anymore as networks and databases tend to run on fairly robust environments. You can imagine my surprise when late last week I received an email from a colleague working with an application suffering from the same symptoms in that blog.
The reason for setting the timeouts is to be able to fail fast as opposed to the application appearing to be non-responsive.
However, what to do if the timeout doesn't seem to be kicking in? In this case the first thing to do is to open a PMR with IBM Support.
Data really needs to be collected in at least two places. On the application server and on the database.
For the database http://www.ibm.com/developerworks/data/library/techarticle/dm-0812wang/ provides information on how to use db2top to collect data when the problem is occurring.
Then for the application server (this happens to be for BPM) http://www-01.ibm.com/support/docview.wss?uid=swg21611603 there are various sets of mustgathers to collect data in order for IBM Support to run analysis on.
Networks can also have hiccups and by running tcpdump on both the application server and database side one can use various protocol inspection tools like Wireshark to look at the underlying network communications to see what, if any, problems may be occurring there.
polozoff 110000N2A2 3,792 Views
I have had a number of conversations with various colleagues lately around "optimal performance." It kind of reminds me of an old post I had about predicting capacity and application performance for a non-existent application (hint: it can't be done).
One scenario we were discussing was an application that was currently in one of its testing cycles. And the question came up, well, could we have another team test the same application and see if they could get better performance or at least validate the performance we are seeing is consistent? That seemed an unusual request. It turns out after some questioning that one team was not confident in the people they had or the results they were seeing. My answer to that was they should pull in the right resources to help. Having a second performance team test the same application made little sense. Just standing up a test environment would have been a big effort not to mention getting floor space in the data center.
So how do you determine if the performance your application sees is optimal performance?
The answer is two fold. First, like my previous post of when is enough performance tuning really enough is driven by business requirements (SLAs) and having the appropriate application monitoring tools in place to verify the application is meeting those SLAs.
The second is a little greyer. While SLAs should hold a lot of the requirements one that is commonly not in scope is capacity. It is difficult to predict capacity until performance testing has been completed. While the application may meet response time and transaction throughput requirements we may discover that, especially in high volume environments, the necessary hardware infrastructure capacity to support the application is going to cost a lot of money (or more money than was expected). Much like the conclusion in my previous article that in this case there may be a reason to continue performance tuning (including application development to reduce resource utilization) to try and reduce the infrastructure cost impact. However, at some point in time a decision will have to be made if the performance tuning effort is exceeding the cost differential of the increased capacity needed to support the application.
Perennially I stumble into the question of when is the performance tuning effort finished? How do you know if enough has been done or there is still further to go?
Personally, I think this should be dictated specifically around Service Level Agreements (SLAs). If the throughput / response time is within the SLAs then we are done tuning. Any more tuning should only be done if the application (in a new release) starts slipping SLAs or the SLAs themselves have changed.
But not every one has defined or agreed upon SLAs which muddles the water. Say, for example, we're building a new application. We have a number of components that we're performance testing but we don't know if the results we are seeing are the best results that can be achieved. Well, that makes sense. If we have no experience working with a particular end point (i.e. database, messaging engine, etc) then obviously our lack of knowledge with that package or our application code will hamper the analysis.
This is where application monitoring comes in. Properly installed and configured the application monitoring tool should be able to provide enough information as to whether we have the best throughput and response time. Look at the metrics, compare CPU and memory utilization. Make sure all the resources under load are as close to maxed out as possible without introducing contention. Look at the throughput / response time graphs as load is ramped up and then how it does after reaching steady state. Once all these have come together you are probably ready to go the next step and over saturate the environment. Once you've proven that over saturation has occurred (because of the noted degradation) then you can probably tune no better. And it will probably take a lot of tuning work to get to the point where you are able to saturate all the resources. This will be through a variety of methods like increased thread pools, vertical/horizontal scaling out, increased memory settings, garbage collection policy analysis, changing application code, etc.
However, is it worth it? Does an organization really want to pay for the most optimal tuning when it may not be necessary? These efforts can be long and drawn out (i.e. months instead of weeks). It also can tie up a lot of highly technically skilled people on an effort that provides for some unknown possible improvement. Lets look at the only scenario I can think of (can you think of another?).
High volume applications tend to also be resource hogs requiring gobs (i.e. hundreds or thousands) of JVMs and the supporting infrastructure. This can get expensive real quick especially when factoring in all the administration, configuration, maintenance costs are factored in. In this one scenario I can see justification, even beyond defined SLAs, to have as best performance as possible in order to reduce the capacity needed to support the application.
I am often posed with the philosophical question
"Is a complete production outage a performance problem?"
From a performance perspective it sure is!
Today's blog post is about a performance aspect most folks in IT miss completely: backup and restore.
"Wait a minute here Mr. Polozoff! We take regular backups!"
"Ah, but have you ever actually taken that backup and restored it to another server?" is my typical response. And you can probably guess the answer I get and it is rarely positive. That is the point of today's post. I was reminded of this from a recent discussion with a colleague that was conducting a review of a client environment. Backup is a regular activity in just about every IT shop but what is not a regular activity is actually seeing if the backup can be restored to another machine. I actually saw the devastating effect this had at one of the clients I was supposed to engage with. On my second day I arrived at the office to be met by the manager I was working with. He had bad news. The production database failed and the backup failed to restore. In fact, none of the backups they had could restore the database. Since I was there to look at some problems in production that effectively ended that engagement because I couldn't study the application server environment if it had no database to connect to.
That is a hard way to learn an important lesson. No matter how often you take a backup it needs to be regularly tested and restored to another server ensuring the backup is what we think it is and not just a jumble of bits that ultimately has zero value.
In a recent review meeting on a problem with a high volume application many of the same questions that have been asked in the past were brought up. How does one prevent one problem from cascading into separate, unrelated facets of the application. On my old blog I spoke about circuit breakers in the specific case of a loop gone haywire. There are other kinds of circuit breakers that can be placed in applications that I have seen and proven work well.
One of the ones I tend to like and haven't really blogged about much allow the operations folks to disable specific functions of an application. This is easily facilitated if the application is well designed (i.e. functions are easily identifiable by examining the HTTP request itself) or is compartmentalized (i.e. separate functions are handled by separate logical clusters) where one cluster of servers only handles the "search" functions because we know that search will tend to exhaust resources vs the "checkout" function which we want to run 100% of the time so that every user that wants to can purchase the goods in their shopping cart. The beauty of this set up is that if any specific function, as detected through the application monitoring infrastructure, is experiencing a failure or is causing an unexpected bottleneck can quickly trip the circuit breaker and shunt any following requests to a "Sorry, not available" page.
The ability of this type of circuit breaker is key for a couple of reasons. First and foremost it addresses the fact that a failure of some sort is in progress and even though it hasn't been fixed we can quickly move traffic to another path that at least gives the end user a response. This avoid additional requests from overwhelming the production environment and having to restart all the servers to clear things up. The other reason is that it also allows for more sharing of the infrastructure because we have a plan to follow in the our runbook where we can quickly alleviate the problem by simply turning off the spigot.
I have seen two different approaches to solving this problem. In the case of the infrastructure if the functions of the application are easily identifiable or clustered independently then the operations team can easily modify either load balancing rules or make changes to the HTTP plugin configuration. I particularly like this one because as soon as the operations team has identified a particular fault they can trip the appropriate circuit breaker and get started with the problem determination steps.
Another approach, which can be used in combination with the previous solution, is to actually build into the application circuit breaker checks at various points in the code. This would then cause a read from the database to check on a bit in the environment to see if it should continue processing the current function or not. Similar to the loop circuit breaker I referenced above where if we know our loops should never iterate more than 500 times to have them abort and throw an exception on the 501st iteration. If there is a consensus among the operations and development teams that some piece of functionality has broken and bit can be flipped in the database and that function is either disabled and directs to an error page or can alternatively provide back some cached value (if possible, it depends on the kind of data the user was going after).
Happy Thanksgiving to everyone. I hope everyone was able to get a good meal and time with family today.
This week I'm writing to you from Seoul, South Korea (it is actually Friday the day AFTER Thanksgiving here yet the Macy's Thanksgiving parade I am watching via Slingbox is still on). I'm working with some colleagues here and doing some mentoring and skills transfer to help broaden the problem determination skills within IBM. Which brings me to today's topic. We encountered a classic application hang. Sometimes, but not all the time, the administrator would restart the application on WAS v8.5 and when the test team started to apply load to the application it would hang. Javacores from kill -3 showed all threads stuck in createOrWaitForConnection. Now for those of you who do follow my blog you probably know about the various techniques I've posted to debug this situation. As we had no access to the developers it was up to us to try and figure out what was causing the hang. Various random twiddling of various AIX OS level parameters didn't work (random changes never do). If they waited long enough the application would sometimes recover and start processing again.
After watching the testing go on for a while I finally suggested we increase the connection pool maximum size to 2n+1 where n = thread pool maximum. The setting the team had set the connection pool maximum was equal to the thread pool max. There was some disbelief that we should go down this path. Any good administrator knows that we want classic funneling where thread pool max is larger than connection pool max to make optimal use of memory, CPU, etc. They re-ran the test and after the 5th attempt realized that we would not recreate the hang. I've posted this command before:
netstat -an |grep ESTA |grep <port#> |wc -l
which gives a connection count to the database on port#. It may be double the value (showing source and destination connections) so you may have to divide the value in half. In our case with thread pool max at 50 and connection pool max set to 101 we were capturing as many as 90 established connections to the database at any one time. Obviously the developers of the application were following the anti-pattern of opening a second connection to the database before closing the first connection which resulted in the deadlock our team in Seoul was observing.
So why wasn't this deadlocking with each and every test? That comes down to randomness. Load tests while they may follow a set process and scripts there is some variability between each test. While it may not vary widely test after test the variability exists in terms of timing on the server. There can be various processes running, or not, at any given point in time. Load on the CPU or tasks the OS is doing can subtly change that timing inducing variability. Timing is key and in some cases the test team got lucky and the test would work. Other times the timing was off and the application would deadlock. This particular anti-pattern is very sensitive to timing. Get the wrong timing and the application will deadlock and hard.
In addition, when they would wait a while the application would recover. This is because underneath the cover of WAS it is quietly reclaiming connections because it knows how long threads have been holding open connections. Once a threshold (timeout) is reached WAS begins the active process of reclaiming connections that have been opened too long. This results in free connections being returned to the pool and the threads that were stuck in createOrWaitForConnection can resume processing.
What is the lesson learned here? When load testing an unknown application it might be worth setting connection pool max to 2n+1 of the thread pool max just to start with and using the command line netstat command (or your application monitoring tools) to see how many connections the application attempts to use. Then once experience is gained with the application reduce the size of the connection pool to something more reasonable based off the observed high water marks in the the connection pool utilization. This is a lot easier tactic than trying to debug an application that is deadlocked in createOrWaitForConnection.
polozoff 110000N2A2 Tags:  machines sla small peformance planning nfr failover frame large capacity 5,653 Views
Yesterday at my "Performance Testing and Analysis" talk a question was asked is it better to have one large machine and virtualize the environment or to have lots of small machines?
Both strategies work. If deciding to go with large capacity frames then you want to have at least 3 frames. This way if one frame is taken out of service there are at least two other frames running. Otherwise, if one builds out only 2 large frames and one is taken out of service then the remaining frame becomes a single point of failure. I don't like SPOFs and so would have at least 3. This is if one frame can take the entire production load during the outage. That information has to be culled from the performance testing to see if there is enough capacity in one frame to carry the entire production load. If not then more likely than not there will need to be additional frames to be able to ensure that even if half the infrastructure is taken out of service the remaining frames can carry the entire production workload.
Likewise, having lots of smaller machines also works. Odds are less likely for a massive hardware outage with smaller machines so as one fails it can be safely taken out of service and replaced without impacting the production workload.
Which strategy is better? In my opinion they are both valid strategies as long as the proper capacity planning is conducted to ensure that when an outage occurs (and think worst case scenario here) that the remaining infrastructure is able to continue processing the production workload without impacting the SLA (Service Level Agreement including the non-functional requirements [i.e. response time, resource utilization, etc]). You do have a defined SLA, right?
At the WTC conference in Berlin this week I held the "Top 10 tuning recommendations for WebSphere Application Server". One of the questions that asked w.r.t. updating the JDBC driver was if it was okay to just update the jar file or did they need to delete and recreate the JDBC provider?
It is fine to just update the jar file and restart the JVM.