WebSphere Peformance - Alexandre Polozoff's Point of View
On Solaris this tuning page has the details that I often follow for Option 1.
In a recent review meeting on a problem with a high volume application many of the same questions that have been asked in the past were brought up. How does one prevent one problem from cascading into separate, unrelated facets of the application. On my old blog I spoke about circuit breakers in the specific case of a loop gone haywire. There are other kinds of circuit breakers that can be placed in applications that I have seen and proven work well.
One of the ones I tend to like and haven't really blogged about much allow the operations folks to disable specific functions of an application. This is easily facilitated if the application is well designed (i.e. functions are easily identifiable by examining the HTTP request itself) or is compartmentalized (i.e. separate functions are handled by separate logical clusters) where one cluster of servers only handles the "search" functions because we know that search will tend to exhaust resources vs the "checkout" function which we want to run 100% of the time so that every user that wants to can purchase the goods in their shopping cart. The beauty of this set up is that if any specific function, as detected through the application monitoring infrastructure, is experiencing a failure or is causing an unexpected bottleneck can quickly trip the circuit breaker and shunt any following requests to a "Sorry, not available" page.
The ability of this type of circuit breaker is key for a couple of reasons. First and foremost it addresses the fact that a failure of some sort is in progress and even though it hasn't been fixed we can quickly move traffic to another path that at least gives the end user a response. This avoid additional requests from overwhelming the production environment and having to restart all the servers to clear things up. The other reason is that it also allows for more sharing of the infrastructure because we have a plan to follow in the our runbook where we can quickly alleviate the problem by simply turning off the spigot.
I have seen two different approaches to solving this problem. In the case of the infrastructure if the functions of the application are easily identifiable or clustered independently then the operations team can easily modify either load balancing rules or make changes to the HTTP plugin configuration. I particularly like this one because as soon as the operations team has identified a particular fault they can trip the appropriate circuit breaker and get started with the problem determination steps.
Another approach, which can be used in combination with the previous solution, is to actually build into the application circuit breaker checks at various points in the code. This would then cause a read from the database to check on a bit in the environment to see if it should continue processing the current function or not. Similar to the loop circuit breaker I referenced above where if we know our loops should never iterate more than 500 times to have them abort and throw an exception on the 501st iteration. If there is a consensus among the operations and development teams that some piece of functionality has broken and bit can be flipped in the database and that function is either disabled and directs to an error page or can alternatively provide back some cached value (if possible, it depends on the kind of data the user was going after).
Application Monitoring Identifying and resolving problems in the cloud with IBM SmartCloud Monitoring
polozoff 110000N2A2 Tags:  cause analysis application cloud troubleshooting patterns root monitoring smartcloud pureapplication 2,575 Views
IBM's new application monitoring solution
During the Q&A session of a customer performance presentation one of the questions asked about performance in migrating from a bare metal environment to a virtualized cloud environment. This is a really good question! As I've said many times when it comes to performance you can not manage what you can not measure. The cloud doesn't escape the rules of Computer Science and Performance 101.
This is where IBM's SmartCloud Monitoring - Application Insight can help. The features page describes how this new application monitoring solution can provide insight into application performance and user experience. Providing capabilities such as dynamic monitoring of cloud applications for a variety of public and private cloud providers including Amazon EC2, VMware and IBM's PureApplication System (IPAS). Diagnostic drill downs and embed monitoring technology to aid troubleshooting and root cause analysis. The resource page provides articles on topics such as service management and proactive application management. A wiki on best practices, a forum for technical questions, a blog for the latest news and updates and a community to engage with IBM experts or peers at other enterprise organizations is all provided from the community page. Finally, documentation and an extensive knowledge base can be found on the SmartCloud support page.
Why is this important? Primarily because clouds are comprised of many VM instances. If not properly configured or tracked the underlying resources the VMs need (i.e. CPU, RAM, network, disk, etc) can easily be over committed resulting in perplexing performance problems. Inside the VM everything may look nominal but without visibility into the cloud itself troubleshooting is near impossible prolonging the negative performance and end user experience. No one wants to prolong a bad end user experience.
In addition, production monitoring data can be fed back into the cloud capacity planning organization. This allows for production data to be used in their calculation models to maintain Service Level Agreements (SLAs) and availability requirements. Cloud infrastructure, while seemingly boundless, can suffer from resource availability as applications grow and mature if no one is monitoring the environment.
From an administration and infrastructure perspective cloud technologies are presenting new and exciting technologies to further simplify those tasks. Last year when I was working on the IPAS performance team I was really enthused by patterns and the powerful capabilities behind them especially where consistency and repeatability is a must. Even more impressive is the symbiotic integration of various IBM technologies like in the mobile space with Worklight developing their IPAS support and mobile application platform patterns.
[Edit to correct typo and added tags and title and a couple of links to more specific content]