Another approach, which can be used in combination with the previous solution, is to actually build into the application circuit breaker checks at various points in the code. This would then cause a read from the database to check on a bit in the environment to see if it should continue processing the current function or not. Similar to the loop circuit breaker I referenced above where if we know our loops should never iterate more than 500 times to have them abort and throw an exception on the 501st iteration. If there is a consensus among the operations and development teams that some piece of functionality has broken and bit can be flipped in the database and that function is either disabled and directs to an error page or can alternatively provide back some cached value (if possible, it depends on the kind of data the user was going after).
WebSphere Peformance - Alexandre Polozoff's Point of View
Resiliency and Circuit BreakersIn a recent review meeting on a problem with a high volume application many of the same questions that have been asked in the past were brought up. How does one prevent one problem from cascading into separate, unrelated facets of the application. On my old blog I spoke about circuit breakers in the specific case of a loop gone haywire. There are other kinds of circuit breakers that can be placed in applications that I have seen and proven work well. One of the ones I tend to like and haven't really blogged about much allow the operations folks to disable specific functions of an application. This is easily facilitated if the application is well designed (i.e. functions are easily identifiable by examining the HTTP request itself) or is compartmentalized (i.e. separate functions are handled by separate logical clusters) where one cluster of servers only handles the "search" functions because we know that search will tend to exhaust resources vs the "checkout" function which we want to run 100% of the time so that every user that wants to can purchase the goods in their shopping cart. The beauty of this set up is that if any specific function, as detected through the application monitoring infrastructure, is experiencing a failure or is causing an unexpected bottleneck can quickly trip the circuit breaker and shunt any following requests to a "Sorry, not available" page. The ability of this type of circuit breaker is key for a couple of reasons. First and foremost it addresses the fact that a failure of some sort is in progress and even though it hasn't been fixed we can quickly move traffic to another path that at least gives the end user a response. This avoid additional requests from overwhelming the production environment and having to restart all the servers to clear things up. The other reason is that it also allows for more sharing of the infrastructure because we have a plan to follow in the our runbook where we can quickly alleviate the problem by simply turning off the spigot. I have seen two different approaches to solving this problem. In the case of the infrastructure if the functions of the application are easily identifiable or clustered independently then the operations team can easily modify either load balancing rules or make changes to the HTTP plugin configuration. I particularly like this one because as soon as the operations team has identified a particular fault they can trip the appropriate circuit breaker and get started with the problem determination steps.
Another approach, which can be used in combination with the previous solution, is to actually build into the application circuit breaker checks at various points in the code. This would then cause a read from the database to check on a bit in the environment to see if it should continue processing the current function or not. Similar to the loop circuit breaker I referenced above where if we know our loops should never iterate more than 500 times to have them abort and throw an exception on the 501st iteration. If there is a consensus among the operations and development teams that some piece of functionality has broken and bit can be flipped in the database and that function is either disabled and directs to an error page or can alternatively provide back some cached value (if possible, it depends on the kind of data the user was going after). |
Maximo v7.5 report scheduler enhancements
Report scheduler enhancements in Maximo v7.5. As with any online transaction application most enterprises need to pull reports from their environment. Reports tend to be (a) scheduled to repeat and (b) heavy users of CPU and memory. Therefore having more control on the report scheduler is a good thing to look at in Maximo v7.5.
|
WebSphere Performance Web site link
I'm sure most of you have this link but just in case you do not... this is the link to the WebSphere Application Server Performance Web site. That should be your first stop for any documentation on performance and tuning WebSphere Application Server.
|
One huge frame or lots of smaller machines?
polozoff
Tags: 
machines
sla
small
peformance
planning
nfr
failover
frame
large
capacity
5,286 Views
Yesterday at my "Performance Testing and Analysis" talk a question was asked is it better to have one large machine and virtualize the environment or to have lots of small machines? Both strategies work. If deciding to go with large capacity frames then you want to have at least 3 frames. This way if one frame is taken out of service there are at least two other frames running. Otherwise, if one builds out only 2 large frames and one is taken out of service then the remaining frame becomes a single point of failure. I don't like SPOFs and so would have at least 3. This is if one frame can take the entire production load during the outage. That information has to be culled from the performance testing to see if there is enough capacity in one frame to carry the entire production load. If not then more likely than not there will need to be additional frames to be able to ensure that even if half the infrastructure is taken out of service the remaining frames can carry the entire production workload. Likewise, having lots of smaller machines also works. Odds are less likely for a massive hardware outage with smaller machines so as one fails it can be safely taken out of service and replaced without impacting the production workload. Which strategy is better? In my opinion they are both valid strategies as long as the proper capacity planning is conducted to ensure that when an outage occurs (and think worst case scenario here) that the remaining infrastructure is able to continue processing the production workload without impacting the SLA (Service Level Agreement including the non-functional requirements [i.e. response time, resource utilization, etc]). You do have a defined SLA, right?
|
lots of threads in socketRead
Someone takes a javacore during what looks to be a hung app server and notices it contains lots of threads in socketRead. This is symptomatic of a slow back end whether it is a database, Web service, etc. An application is as strong as its weakest link. If the backend the application depends on is unable to respond in a timely manner then there is nothing that can be tuned at the application layer except for aggressive timeouts to protect the application from getting stuck. Hangs like these typically happen under high load/traffic conditions. It is important that the group that maintains the backend is aware of an issue with their tier and they need to fix it.
|
Updating the JDBC driverAt the WTC conference in Berlin this week I held the "Top 10 tuning recommendations for WebSphere Application Server". One of the questions that asked w.r.t. updating the JDBC driver was if it was okay to just update the jar file or did they need to delete and recreate the JDBC provider? It is fine to just update the jar file and restart the JVM.
|
Application Monitoring Identifying and resolving problems in the cloud with IBM SmartCloud Monitoring
polozoff
Tags: 
cause
analysis
application
troubleshooting
cloud
patterns
root
smartcloud
monitoring
pureapplication
5,149 Views
IBM's new application monitoring solution During the Q&A session of a customer performance presentation one of the questions asked about performance in migrating from a bare metal environment to a virtualized cloud environment. This is a really good question! As I've said many times when it comes to performance you can not manage what you can not measure. The cloud doesn't escape the rules of Computer Science and Performance 101. This is where IBM's SmartCloud Monitoring - Application Insight can help. The features page describes how this new application monitoring solution can provide insight into application performance and user experience. Providing capabilities such as dynamic monitoring of cloud applications for a variety of public and private cloud providers including Amazon EC2, VMware and IBM's PureApplication System (IPAS). Diagnostic drill downs and embed monitoring technology to aid troubleshooting and root cause analysis. The resource page provides articles on topics such as service management and proactive application management. A wiki on best practices, a forum for technical questions, a blog for the latest news and updates and a community to engage with IBM experts or peers at other enterprise organizations is all provided from the community page. Finally, documentation and an extensive knowledge base can be found on the SmartCloud support page. Why is this important? Primarily because clouds are comprised of many VM instances. If not properly configured or tracked the underlying resources the VMs need (i.e. CPU, RAM, network, disk, etc) can easily be over committed resulting in perplexing performance problems. Inside the VM everything may look nominal but without visibility into the cloud itself troubleshooting is near impossible prolonging the negative performance and end user experience. No one wants to prolong a bad end user experience. In addition, production monitoring data can be fed back into the cloud capacity planning organization. This allows for production data to be used in their calculation models to maintain Service Level Agreements (SLAs) and availability requirements. Cloud infrastructure, while seemingly boundless, can suffer from resource availability as applications grow and mature if no one is monitoring the environment. From an administration and infrastructure perspective cloud technologies are presenting new and exciting technologies to further simplify those tasks. Last year when I was working on the IPAS performance team I was really enthused by patterns and the powerful capabilities behind them especially where consistency and repeatability is a must. Even more impressive is the symbiotic integration of various IBM technologies like in the mobile space with Worklight developing their IPAS support and mobile application platform patterns.
[Edit to correct typo and added tags and title and a couple of links to more specific content] |
Java native out of memory errorsAs applications grow over time they tend to add features and functions and then one day they run out of native heap as more and more Java classes are piled in.
Edit: Oct 30, 2015 and Nov 17, 2015 Native OOM (NOOM) landscape continues to shift. An argument to offset the heap to a different area in the address space is much better (-Xgc:preferredHeapBase). With this new argument, one can place the Java heap allocated past the initial 4g of address space, allowing all native code to use almost all of the lower 4g of space. |
Software complexity, faults, remediation from a NASA perspectiveI'm always on the look out for interesting reading on the topic of software complexity and failure. Through serendipity I came across this fascinating document from NASA [ http://www.nasa.gov/pdf/418878main_FSWC_Final_Report.pdf ] on just this topic. I think one could easily remove the word "flight" from this document and see immediate applicability to their own enterprise environment. What interesting sites have you come across on these topics?
|
Prepare for problemsI am writing this blog entry to remind people that data collection should occur in all phases of an application's life cycle in production. This includes collecting data like javacores and heapdumps in production when the application is running in a nominal, steady state condition. This provides valuable data for when problems do occur to be able to compare the data when negative behaviour is occurring to when things were not bad. The data also helps to feed trend analysis in conjunction with application monitoring tools in terms of understanding what users are doing under both conditions. Collect data often and be prepared for that day when all of the sudden the production environment is exhibiting distress.
|
WebSphere Application Server v8.5 redbook
If you're looking at the v8.5 release of WebSphere Application Server you will want to check out this redbook.
|
Underscores not allowed in host namesAs part of a troubleshooting exercise we uncovered what appears to be a not commonly known limitations in host names.
"Avoid using the underscore (_) character in machine names. Internet standards dictate that domain names conform to the host name requirements described in Internet Official Protocol Standards RFC 952 and RFC 1123. Domain names must contain only letters (upper or lower case) and digits. Domain names can also contain dash characters ( - ) as long as the dashes are not on the ends of the name. Underscore characters ( _ ) are not supported in the host name. If you have installed WebSphere Application Server on a machine with an underscore character in the machine name, access the machine with its IP address until you rename the machine."
|
Investigating BigInteger.oddModPow
polozoff
Tags: 
multiplytolen
high
sparc
solaris
subn
montreduce
muladd
squaretolen
biginteger
cpu
hang
4,800 Views
On Solaris on Sparc we're seeing a scenario of high CPU with the majority threads doing work in similar thread stacks (see below) with the top of the stack sometimes in montReduce, squareToLen, multiplyToLen, subN. Obviously the scenario is a number of new TLS connections are incoming but a bug identified as 8153189 causes high CPU. However, on Solaris on Sparc platform it appears there is no fix available even though there is a fix available for Solaris on x86 (however it is not enabled by default you have to use -XX parameters to enable the fix. See earlier link). I am still waiting on Java to confirm the fix status.
This particular scenario is playing out in the tiers between IHS and WAS. The workaround is to minimize the frequency of TLS handshakes by setting the IHS configs to maximize settings so the connections persist and are not destroyed and to reconfigure WAS to have unlimited requests per connection. See addendum at the end of this post:
"WebContainer : 123” daemon prio=3 tid=0x00123456 nid=0xfffa runnable [0x1234568a0]
IHS Configs: ThreadLimit 25
WAS Configs: Servers > Application servers > $SERVER > Web container settings > Web container transport chains > * > HTTP Inbound Channel > Select "Use persistent (keep-alive) connections" and "Unlimited persistent requests per connection" (and then restart the server)
|
SpecJ results are availableSee this link for the SpecJ performance results that are available. |
Blast from the pastI had a blog post on socketRead issues causing hung thread in this blog post and using timeouts when communicating with a database back in 2009. These problems don't happen very often anymore as networks and databases tend to run on fairly robust environments. You can imagine my surprise when late last week I received an email from a colleague working with an application suffering from the same symptoms in that blog. The reason for setting the timeouts is to be able to fail fast as opposed to the application appearing to be non-responsive. However, what to do if the timeout doesn't seem to be kicking in? In this case the first thing to do is to open a PMR with IBM Support. Data really needs to be collected in at least two places. On the application server and on the database. For the database http://www.ibm.com/developerworks/data/library/techarticle/dm-0812wang/ provides information on how to use db2top to collect data when the problem is occurring. Then for the application server (this happens to be for BPM) http://www-01.ibm.com/support/docview.wss?uid=swg21611603 there are various sets of mustgathers to collect data in order for IBM Support to run analysis on. Networks can also have hiccups and by running tcpdump on both the application server and database side one can use various protocol inspection tools like Wireshark to look at the underlying network communications to see what, if any, problems may be occurring there. |