Modified by polozoff
The popular XC10 appliance has periodic firmware upgrades. However, a recent client was experiencing slower response times from the XC10 running with a recent firmware upgrade. After much looking we saw in the AIX client and XC10 packet traces that the ACK from the AIX client was taking 200ms on almost every response from the XC10. This anomaly can be remedied by setting on AIX
no -o -o tcp_nodelayack=1
The topic gets into interesting TCP/IP conversations about Nagle algorithms, MTU, piggybacking ACKs on data packets and timeouts that I was not aware of but the AIX level TCP/IP configuration change resolved the problem. A similar setting exists for Linux (see techrepublic link below) Note: this can increase the number of packets on the network however in our testing after making the change we did not see that side effect. However, collect the necessary tcpdump/iptrace to trust and verify.
Edit Nov 20 to add some references and interesting related reading
For some reason I had problems downloading the Liberty Profile 9 beta getting 404 not found responses. This link sent to me by a colleague worked. Not sure why but if you're having trouble downloading the beta follow the link.
Modified by polozoff
As applications grow over time they tend to add features and functions and then one day they run out of native heap as more and more Java classes are piled in.
I am promoting this link written by one of my colleages that covers how to use -Xmcrs and setting it to 200M or higher. The fact I have seen this twice in the past month tells me this is a growing (pun intended) problem.
Edit: Oct 30, 2015 and Nov 17, 2015
Native OOM (NOOM) landscape continues to shift. An argument to offset the heap to a different area in the address space is much better (-Xgc:preferredHeapBase). With this new argument, one can place the Java heap allocated past the initial 4g of address space, allowing all native code to use almost all of the lower 4g of space.
The Xmcrs option is not used anymore because of this. Xmcrs is still used. Here is the latest technote that supersedes the link I provided previously. More on troubleshooting out of memory errors.
As part of a troubleshooting exercise we uncovered what appears to be a not commonly known limitations in host names.
"Avoid using the underscore (_) character in machine names. Internet standards dictate that domain names conform to the host name requirements described in Internet Official Protocol Standards RFC 952 and RFC 1123. Domain names must contain only letters (upper or lower case) and digits. Domain names can also contain dash characters ( - ) as long as the dashes are not on the ends of the name. Underscore characters ( _ ) are not supported in the host name. If you have installed WebSphere Application Server on a machine with an underscore character in the machine name, access the machine with its IP address until you rename the machine."
In over 3 decades of IT one of the consistent themes of my job has been transforming under performing business units for various clients. Some of the scenarios I've been called in to transform have been
- retailers suffering from unplanned outages causing revenue loss and potential loss of customer loyalty for their brand
- customer service centers having users calling the call center because the application is not performing at "market speed"
- financial institutions or health organizations having to shut down services due to users seeing other user's data
Those are just three examples of under performing scenarios each of which affects the end user's experience. If those user's are not locked into the brand they could potentially leave and go to a competitor and never come back. The underlying theme in the scenarios needs to be addressed by someone with extensive operational and application development background in information technology. Here are some of the strategies I have used to transform under performers to over achievers.
One of the first things I learned in my career was taking charge. This is not as easy as it sounds. First, it meant having extreme confidence in myself and my decisions. In the early days sometimes I was successful but other times I was not and needed to step back and reconsider what the next steps had to be. What follows are some of my lessons learned with taking charge.
Under performing business units tend to have difficulty making decisions. Quite simply there are too many ways to do any single technical task or effort that solves a problem. When an organization runs by consensus it is even harder to make a decision. I learned early in my career that giving any organization a choice even as simple between effortA and effortB was futile. I have spent countless hours in meetings over the risks and ramifications of the two, or more, choices. I found that providing one solution and only one solution as the most expeditious way to move an organization forward. All the while keeping plan B in the back of my mind in case my first decision hit a road block. However, at this point I have had enough experiences and previous failures to pretty much nail what plan A needs to be and how to execute it.
Have a plan and articulate it
A plan means nothing if the plan makes no sense to anyone else. When I provide the direction to move an organization forward it comes with a step by step plan that addresses
- short term, immediate tactical steps, risks and goals addressing the must haves for the current problem(s)
- intermediate and longer term approach to the would like to have goals
Even the short term, immediate tactical steps may have several iterations of different efforts that can span weeks or months depending on how severe the problem. Inevitably regardless of the scope of the actions requires actions in both operations and application development. Though sometimes I got lucky and it was only one or the other. But not often.
Start with the basics. Have things like the recommended OS level tunings been applied? If not then that is the first part of the plan.
It should go without saying that any plan should be testable outside of the production environment. However, as applications mature and the user base grows so does the operational IT environment. Test as much as possible and keep production changes down to one change per change window with a tested back out plan. Which brings us to the next topic.
Repeatability (AKA scripting)
To minimize risk a robust operational IT infrastructure requires the ability to perform tasks over and over again understanding exactly what the resulting output should be. Whether it is configuring a configuration item or deploying an application we should clearly understand the end result. Scripts developed and tested in test that work can be promoted to production. I'll note here that with the advent of devOps this facet of IT operations has become significantly easier and more robust than in the past. In some of the testing I manage internally for very large scale performance testing of 10,000 Liberty servers in a SoftLayer environment I know that our gradle scripts will build out the environment from scratch the same way each and every time.
Change is a scary word for a lot of people because it also means risk to the business. This is why change processes should be followed meticulously. Having redundancy and lots of it also reduces risk during a change. If redundancy doesn't exist then it needs to be the first part of the transformation plan.
Operational, infrastructural changes versus application fixes
I have always separated operational changes from application fixes in the same change window. Depending on the speed that application fixes can be identified, coded and tested ultimately depends on how quickly application fixes will be introduced. Sometimes it is an easy code fix but other times whole architectures or designs need to be re-worked due to poor decision making. Though the same level of complexity can exist in the IT infrastructure slowing down the speed with which change can be made because developing scripts or testing can take a lot of time. And testing the back out plan can take longer than testing the solution.
Move the organization to be proactive
Under performing organizations typically react to problems. That means that the problem(s) may have been impacting the end user's experience for some time. Application monitoring is key to helping an organization react proactively to problems. I once lead a business unit that was penalized every quarter for server uptime not meeting SLAs to collecting a bonus every quarter for exceeding the SLAs. All by installing and configuring the right tools to allow the organization to identify problems and notify the right people to rectify the problem before the end user ever noticed.
One thing I try to leave with each business unit I've transformed is how to innovate. This is how the business unit goes in to high achiever mode. Mentoring them in how to think differently about problems and the approaches they take in IT. Encouraging wild ducks so to speak.
Transforming under performing business units takes as much leadership as it takes technical prowess. I have found that the more prominent the problem (e.g. complete application outage) the easier it was to troubleshoot and fix. Intermittent issues or glitches like every once in a while our response time goes from 30ms to 560ms tended to be more difficult as capturing data (nonetheless the right data) at the time of the problem can be difficult. But that only means more effort needs to be spent on the application monitoring tools in order to flush out the necessary data.
Modified by polozoff
I'm working on a Liberty server (this is the latest beta I downloaded a couple of days ago) and using the installUtility I'm getting the following error.
# bin/installUtility install adminCenter-1.0
Establishing a connection to the configured repositories...
This process might take several minutes to complete.
CWWKF1219E: The IBM WebSphere Liberty Repository cannot be reached. Verify that your computer has network access and firewalls are configured correctly, then try the action again. If the connection still fails, the repository server might be temporarily unavailable.
I then found out about a command to help try and figure out what is wrong
bin]# ./installUtility find --type=addon --verbose=debug
[6/25/15 10:57:53:093 CDT] Establishing a connection to the configured repositories...
This process might take several minutes to complete.
[6/25/15 10:57:53:125 CDT] Failed to connect to the configured repository:
IBM WebSphere Liberty Repository
[6/25/15 10:57:53:125 CDT] Reason: The connection to the default repository failed with the
following exception: RepositoryBackendIOException: Failed to read
[6/25/15 10:57:53:128 CDT] com.ibm.ws.massive.RepositoryBackendIOException: Failed to read properties file https://public.dhe.ibm.com/ibmdl/export/pub/software/websphere/wasdev/downloads/assetservicelocation.props
Caused by: java.net.SocketException: java.lang.ClassNotFoundException: Cannot find the specified class com.ibm.websphere.ssl.protocol.SSLSocketFactory
Will update when I have more details on why I'm getting the ClassNotFoundException.
and that resolves the issue. A defect has been raised to have the script use the Java we supply instead of the machine's.
[Edited Aug 25 to add
I also needed to update /etc/host.conf to enable hosts file lookup and then add entries for
to /etc/hosts file
The Aug 2015 beta seems to have made a number of fixes to installUtility so if you're on an older beta get the latest.]
A few years ago I blogged about how adding JSESSIONID logging to the access log helps identify which cluster member a user was pinned to. It turns out this also helps troubleshoot another interesting problem.
A WebSphere Application Server administrator noted that the session count on one of their JVMs in the cluster was getting far higher session counts than any other JVM in the cluster. So much so it was like a 3:1 imbalance in total number of sessions in the JVM. We applied the JSESSIONID logging and captured all the session ids. Through various Unix utilities (cut, sort, uniq, etc) we ended up with a prime suspect. One session was calling the /login page 10-20 times per second and had eclipsed every other session by over 10x the number of requests.
Why did we go down this path? We were able to see through the PMI data that the session manager in WebSphere Application Server was invalidating sessions. So we knew it wasn't an issue in the product of not deleting sessions. Also, with one JVM in the cluster creating more sessions than the other JVMs is suspicious. I would have expected to have seen higher load across the cluster. In addition, they have seen the behaviour move around the cluster every few months. That lead me to believe this was like a replay attack. Someone at some point captured a response with a JSESSIONID and was then using that JSESSIONID over and over again until some event caused it to capture a new JSESSIONID (most likely from a failover event as the cluster went through a rolling restart). That behaviour was curious! The fact it was smart enough to realize the HTTP header content changed and adapted was interesting.
So next time you see one or more JVMs with considerably higher session counts than the other JVMs in the same cluster you can use the same troubleshooting methodology to track down who the suspect is. Especially if your application is Internet-facing meaning anyone can start pinging your application.
See this link for the SpecJ performance results that are available.
Modified by polozoff
The WebSphere Application Server Performance Cookbook has been published! I've hinted about this book in previous excerpt postings. Now you can read it in its entirety. Get ready for a long read. The book encapsulates the WAS/Java/Cloud performance knowledge of some of the smartest people in IBM.
A few days ago a few colleagues contacted me about my article on proactive application monitoring. They're building some templates for monitoring applications in the cloud and they had some questions specifically around thresholds for many of the metrics I had listed. For example, one of the questions was around datasource connection pool utilization. Is it reasonable to set thresholds for warnings if the connection pool was 85% utilized and critical if it was 95% utilized? Likewise, similar questions around CPU utilization and would a warning at 75% and critical alerts at 90% be reasonable?
The answer is, (drum roll please) it depends.
No two applications are alike. There are low volume, rarely used applications that may never get above 2% connection pool utilization. Conversely, there are high volume applications where the connection pool can be running at 90-100% utilization. Better metrics to watch (via the PMI metrics) are (a) how many threads had to wait for a connection from the connection pool and (b) how long those threads had to wait. Both of those metrics directly impact the throughput and response time of the application.
Same with CPU utilization. Some organizations like to run their servers hot over 90% utilization because they have spare, passive capacity they can bring online. Others like to run at less than 50% utilization because they want to have spare capacity in an active-active modus operandi.
Setting useful thresholds depends on understanding the organization's Service Level Agreements (SLAs) and the application's Non Functional Requirements (NFRs).
Modified by polozoff
Here is another excerpt from our performance cookbook that will be published in the near future.
Excessive Direct Byte Buffers
Excessive native memory usage by java.nio.DirectByteBuffers is a classic problem with any generational garbage collector such as gencon (which is the default starting in IBM Java 6.26/WAS 8), particularly on 64-bit. DirectByteBuffers (DBBs) (http://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html) are Java objects that allocate and free native memory. DBBs use a PhantomReference which is essentially a more flexible finalizer and they allow the native memory of the DBB to be freed once there are no longer any live Java references. Finalizers and their ilk are generally not recommended because their cleanup time by the garbage collector is non-deterministic.
This type of problem is particularly bad with generational collectors because the whole purpose of a generational collector is to minimize the collection of the tenured space (ideally never needing to collect it). If a DBB is tenured, because the size of the Java object is very small, it puts little pressure on the tenured heap. Even if the DBB is ready to be garbage collected, the PhantomReference can only become ready during a tenured collection. Here is a description of this problem (which also talks about native classloader objects, but the principle is the same):
If an application relies heavily on short-lived class loaders, and nursery collections can keep up with any other allocated objects, then tenure collections might not happen very frequently. This means that the number of classes and class loaders will continue increasing, which can increase the pressure on native memory... A similar issue can arise with reference objects (for example, subclasses of java.lang.ref.Reference) and objects with finalize() methods. If one of these objects survives long enough to be moved into tenure space before becoming unreachable, it could be a long time before a tenure collection runs and "realizes" that the object is dead. This can become a problem if these objects are holding on to large or scarce native resources. We've dubbed this an "iceberg" object: it takes up a small amount of Java heap, but below the surface lurks a large native resource invisible to the garbage collector. As with real icebergs, the best tactic is to steer clear of the problem wherever possible. Even with one of the other GC policies, there is no guarantee that a finalizable object will be detected as unreachable and have its finalizer run in a timely fashion. If scarce resources are being managed, manually releasing them wherever possible is always the best strategy. (http://www.ibm.com/developerworks/websphere/techjournal/1106_bailey/1106_bailey.html)
Essentially the problem boils down to either:
There are too many DBBs being allocated (or they are too large), and/or
The DBBs are not being cleared up quickly enough.
It is very important to verify that the volume and rate of DBB allocations are expected or optimal. If you would like to determine who is allocating DBBs (problem #1), of what size, and when, you can run a DirectByteBuffer trace. Test the overhead of this trace in a test environment before running in production.
One common cause of excessive DBB allocations is the default WAS WebContainer channelwritetype value of async. In this mode, all writes to servlet response OutputStreams (e.g. static file downloads from the application or servlet/JSP responses) are sent to the network asynchronously. If the network and/or the end-user do not keep up with the rate of network writes, the response bytes are buffered in DBB native memory without limit. Even if the network and end-user do keep up, this behavior may simply create a large volume of DBBs that can build up in the tenured area. You may change channelwritetype to sync to avoid this behavior although servlet performance may suffer, particularly for end-users on WANs.
If you would like to clear up DBBs more often (problem #2), there are a few options:
Specifying MaxDirectMemorySize will force the DBB code to run System.gc() when the sum of outstanding DBB native memory would be more than $bytes. This option may have performance implications. When using this option with IBM Java, ensure that -Xdisableexplicitgc is not used. The optimal value of $bytes should be determined through testing. The larger the value, the more infrequent the System.gcs will be but the longer each tenured collection will be. For example, start with -XX:MaxDirectMemorySize=1024m and gather throughput, response time, and verbosegc garbage collection overhead numbers and compare to a baseline. Double and halve this value and determine which direction is better and then do a binary search for the optimal value.
Explicitly call System.gc. This is generally not recommended. When DBB native memory is freed, the resident process size may not be reduced immediately because small allocations may go onto a malloc free list rather than back to the operating system. So while you may not see an immediate drop in RSS, the free blocks of memory would be available for future allocations so it could help to "stall" the problem. For example, Java Surgery can inject a call to System.gc into a running process: https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=7d3dc078-131f-404c-8b4d-68b3b9ddd07a
In most cases, something like -XX:MaxDirectMemorySize=1024m (and ensuring -Xdisableexplicitgc is not set) is a reasonable solution to the problem.
A system dump or HPROF dump may be loaded in the IBM Memory Analyzer Tool & the IBM Extensions for Memory Analyzer DirectByteBuffer plugin may be run to show how much of the DBB native memory is available for garbage collection. For example:
=> Sum DirectByteBuffer capacity available for GC: 1875748912 (1.74 GB)
=> Sum DirectByteBuffer capacity not available for GC: 72416640 (69.06 MB)
There is an experimental technique called Java surgery which uses the Java Late Attach API (http://docs.oracle.com/javase/6/docs/technotes/guides/attach/index.html) to inject a JAR into a running process and then execute various diagnostics: https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=7d3dc078-131f-404c-8b4d-68b3b9ddd07a
This was designed initially for Windows because it does not usually have a simple way of requesting a thread dump like `kill -3` on Linux. Java Surgery has an option with IBM Java to run the com.ibm.jvm.Dump.JavaDump() API to request a thread dump (Oracle Java does not have an equivalent API, although Java Surgery does generally work on Oracle Java):
$ java -jar surgery.jar -pid 16715 -command JavaDump
Another excerpt from our WebSphere Application Server Performance Cookbook, due for external publication sometime in the near future, on determining the health of a JVM. This may or may not look like the final publication.
"A common question is how does one determine how efficiently is the JVM performing and what metrics point to a JVM that is in, or heading toward, distress?
Depending on the environment, number of JVMs, redundancy, continuous availability and/or high availability requirements the threshold for %CPU utilization varies. For HA/CA, business critical environments the threshold can be as low as 50% CPU utilization. For non-critical applications the threshold could be as high as 95%. One needs to analyze both the NFRs and SLAs of the application in order to determine appropriate thresholds to indicate a potential health issue with the JVM.
Amount of times spent in GC
This metric, gleaned from the verbose GC or PMI metrics, is a general indicator of how efficiently the application is utilizing memory and how quickly the garbage collector can complete its tasks. The more time spent in GC the more CPU the application will use and potentially impact the application response time. A general rule of thumb is time spent in GC below 8% is generally a marker of a healthy application environment. If the time spent in GC goes over 8% then it is probably time to either try and tune the JVM or start looking at capacity planning to grow the environment.
%heap utiilization after a full GC
The low water mark after a full GC provides an indication if the heap is able to reclaim memory or not. If the low water mark continues to rise over time after a full GC then the application could be the victim of a memory leak. Heap dumps should be able to identify the culprit and the application can either be corrected to eliminate the leak. Unfortunately, if the application can not be fixed the only way to recover from a memory leak is through a controlled restart of the JVM. In a clustered environment this is not generally a problem if the JVM users can be quiesced to another JVM before restarting the JVM otherwise inflight transactions will be affected when the JVM is stopped abruptly.
Application response time
Deteriorating (i.e. increasing) response time is often an indication of poor health.
Once you have determined that the application is not healthy follow the appropriate MustGather and open a PMR with IBM Support."
I don't normally get involved with User Interface (UI) performance but here is a good article that describes some tips and techniques to make your UI appear snappier. The few times I have been involved with UI performance issues I've used IBM Page Detailer which is a free download that also has some IBM Research papers and links around various performance improvements.
If you do any work around Java Server Faces (JSF) you'll find IBM Page Detailer pretty handy.
I've been involved in some performance testing of various application deployment tools that happen to be open source. The one common theme I have run into are buggy software. Apparently open source tools are following the continuous delivery strategy pushing out new builds and frequently. However, testing seems to be sorely lacking. At least once a week some part of my deployment fails because of a recent update. Something that worked last week quit working this week.
In addition, dependencies appear to be tricky for the open source owner. The number one bug has been the new version has some how messed up the dependencies and refuses to run.
I would really like to see some comments on how other people manage these bugs. I would be pretty upset if my application deployment ground to a halt because of a buggy release.
It is being reported in the news there is another OpenSSL bug around the SSL handshake and being able to force the handshake to use a less secure encryption method. This is interesting because I was troubleshooting a problem a couple of months back in two supposedly identical prod/test environments but test was negotiating tlsv2 but the prod environment was negotiating sslv3. We eventually solved that problem by changing the configuration of the prod environment to use only tlsv2. But curious how this was showing up in the prod environment. May be time to circle back and take a closer look.
Released earlier today fix pack 1 for WebSphere Application Server v8.5.5.
Modified by polozoff
A lot of applications use JMS for asynchronous messaging. And as many of you know I really am into scalability especially when working with high volume applications. There is a two part article from one of my UK colleagues for a flexible and scalable WebSphere MQ topology pattern Part 1 and Part 2 which contains code samples.
Troubleshooting a problem? If you are following the mustgathers then here is a link to a quick reference guide on how to set up trace in WebSphere Application Server.
Modified by polozoff
For as long as I can remember the most debated Java topic has been the difference in opinion on the heap size minimum = maximum with lots of urban myths and legends that having them equal was better. In a conversation with a number of colleagues and Chris Bailey who has lead the Java platform for many years he clarified the settings for the IBM JVM based on generational vs non-generational policy settings.
"The guidance [for generational garbage collection policy] is that you should fix the nursery size: -Xmns == -Xmnx, and allow the tenured heap to vary: -Xmos != -Xmox. For non generational you only have a tenured heap, so -Xms != -Xmx applies.
The reason being that the ability to expand the heap adds resilience into the system to avoid OutOfMemoryErrors. If you're then worried about the potential cost of expansion/shrinkage that this introduces by causing compactions, then that can be mitigated by adjusting -Xmaxf and -Xminf to make expand/shrink a rare event."
A link to Chris Bailey's presentation on generational garbage collection http://www.slideshare.net/cnbailey/tuning-ibms-generational-gc-14062096
[edit to correct typo, added tags]
Some interesting updates to licensing WebSphere Application Server (Express, Base, Network Deployment and Liberty Core) today.
Quoting the "At a Glance" from the link above:
WebSphere® Application Server now offers Fixed Term License options:
Provides additional flexibility for projects that have a limited duration
Gives the option to renew the term license or discontinue use of the software at the end of the fixed term
Includes Software Subscription and Support for the period of the license
WebSphere Application Server V8.5.5 license terms are updated to relax server load balancing and failover restrictions
Modified by polozoff
IBM's new application monitoring solution
During the Q&A session of a customer performance presentation one of the questions asked about performance in migrating from a bare metal environment to a virtualized cloud environment. This is a really good question! As I've said many times when it comes to performance you can not manage what you can not measure. The cloud doesn't escape the rules of Computer Science and Performance 101.
This is where IBM's SmartCloud Monitoring - Application Insight can help. The features page describes how this new application monitoring solution can provide insight into application performance and user experience. Providing capabilities such as dynamic monitoring of cloud applications for a variety of public and private cloud providers including Amazon EC2, VMware and IBM's PureApplication System (IPAS). Diagnostic drill downs and embed monitoring technology to aid troubleshooting and root cause analysis. The resource page provides articles on topics such as service management and proactive application management. A wiki on best practices, a forum for technical questions, a blog for the latest news and updates and a community to engage with IBM experts or peers at other enterprise organizations is all provided from the community page. Finally, documentation and an extensive knowledge base can be found on the SmartCloud support page.
Why is this important? Primarily because clouds are comprised of many VM instances. If not properly configured or tracked the underlying resources the VMs need (i.e. CPU, RAM, network, disk, etc) can easily be over committed resulting in perplexing performance problems. Inside the VM everything may look nominal but without visibility into the cloud itself troubleshooting is near impossible prolonging the negative performance and end user experience. No one wants to prolong a bad end user experience.
In addition, production monitoring data can be fed back into the cloud capacity planning organization. This allows for production data to be used in their calculation models to maintain Service Level Agreements (SLAs) and availability requirements. Cloud infrastructure, while seemingly boundless, can suffer from resource availability as applications grow and mature if no one is monitoring the environment.
From an administration and infrastructure perspective cloud technologies are presenting new and exciting technologies to further simplify those tasks. Last year when I was working on the IPAS performance team I was really enthused by patterns and the powerful capabilities behind them especially where consistency and repeatability is a must. Even more impressive is the symbiotic integration of various IBM technologies like in the mobile space with Worklight developing their IPAS support and mobile application platform patterns.
[Edit to correct typo and added tags and title and a couple of links to more specific content]
Modified by polozoff
I had a blog post on socketRead issues causing hung thread in this blog post and using timeouts when communicating with a database back in 2009. These problems don't happen very often anymore as networks and databases tend to run on fairly robust environments. You can imagine my surprise when late last week I received an email from a colleague working with an application suffering from the same symptoms in that blog.
The reason for setting the timeouts is to be able to fail fast as opposed to the application appearing to be non-responsive.
However, what to do if the timeout doesn't seem to be kicking in? In this case the first thing to do is to open a PMR with IBM Support.
Data really needs to be collected in at least two places. On the application server and on the database.
For the database http://www.ibm.com/developerworks/data/library/techarticle/dm-0812wang/ provides information on how to use db2top to collect data when the problem is occurring.
Then for the application server (this happens to be for BPM) http://www-01.ibm.com/support/docview.wss?uid=swg21611603 there are various sets of mustgathers to collect data in order for IBM Support to run analysis on.
Networks can also have hiccups and by running tcpdump on both the application server and database side one can use various protocol inspection tools like Wireshark to look at the underlying network communications to see what, if any, problems may be occurring there.
Modified by polozoff
I have had a number of conversations with various colleagues lately around "optimal performance." It kind of reminds me of an old post I had about predicting capacity and application performance for a non-existent application (hint: it can't be done).
One scenario we were discussing was an application that was currently in one of its testing cycles. And the question came up, well, could we have another team test the same application and see if they could get better performance or at least validate the performance we are seeing is consistent? That seemed an unusual request. It turns out after some questioning that one team was not confident in the people they had or the results they were seeing. My answer to that was they should pull in the right resources to help. Having a second performance team test the same application made little sense. Just standing up a test environment would have been a big effort not to mention getting floor space in the data center.
So how do you determine if the performance your application sees is optimal performance?
The answer is two fold. First, like my previous post of when is enough performance tuning really enough is driven by business requirements (SLAs) and having the appropriate application monitoring tools in place to verify the application is meeting those SLAs.
The second is a little greyer. While SLAs should hold a lot of the requirements one that is commonly not in scope is capacity. It is difficult to predict capacity until performance testing has been completed. While the application may meet response time and transaction throughput requirements we may discover that, especially in high volume environments, the necessary hardware infrastructure capacity to support the application is going to cost a lot of money (or more money than was expected). Much like the conclusion in my previous article that in this case there may be a reason to continue performance tuning (including application development to reduce resource utilization) to try and reduce the infrastructure cost impact. However, at some point in time a decision will have to be made if the performance tuning effort is exceeding the cost differential of the increased capacity needed to support the application.
Perennially I stumble into the question of when is the performance tuning effort finished? How do you know if enough has been done or there is still further to go?
Personally, I think this should be dictated specifically around Service Level Agreements (SLAs). If the throughput / response time is within the SLAs then we are done tuning. Any more tuning should only be done if the application (in a new release) starts slipping SLAs or the SLAs themselves have changed.
But not every one has defined or agreed upon SLAs which muddles the water. Say, for example, we're building a new application. We have a number of components that we're performance testing but we don't know if the results we are seeing are the best results that can be achieved. Well, that makes sense. If we have no experience working with a particular end point (i.e. database, messaging engine, etc) then obviously our lack of knowledge with that package or our application code will hamper the analysis.
This is where application monitoring comes in. Properly installed and configured the application monitoring tool should be able to provide enough information as to whether we have the best throughput and response time. Look at the metrics, compare CPU and memory utilization. Make sure all the resources under load are as close to maxed out as possible without introducing contention. Look at the throughput / response time graphs as load is ramped up and then how it does after reaching steady state. Once all these have come together you are probably ready to go the next step and over saturate the environment. Once you've proven that over saturation has occurred (because of the noted degradation) then you can probably tune no better. And it will probably take a lot of tuning work to get to the point where you are able to saturate all the resources. This will be through a variety of methods like increased thread pools, vertical/horizontal scaling out, increased memory settings, garbage collection policy analysis, changing application code, etc.
However, is it worth it? Does an organization really want to pay for the most optimal tuning when it may not be necessary? These efforts can be long and drawn out (i.e. months instead of weeks). It also can tie up a lot of highly technically skilled people on an effort that provides for some unknown possible improvement. Lets look at the only scenario I can think of (can you think of another?).
High volume applications tend to also be resource hogs requiring gobs (i.e. hundreds or thousands) of JVMs and the supporting infrastructure. This can get expensive real quick especially when factoring in all the administration, configuration, maintenance costs are factored in. In this one scenario I can see justification, even beyond defined SLAs, to have as best performance as possible in order to reduce the capacity needed to support the application.
I am often posed with the philosophical question
"Is a complete production outage a performance problem?"
From a performance perspective it sure is!
Today's blog post is about a performance aspect most folks in IT miss completely: backup and restore.
"Wait a minute here Mr. Polozoff! We take regular backups!"
"Ah, but have you ever actually taken that backup and restored it to another server?" is my typical response. And you can probably guess the answer I get and it is rarely positive. That is the point of today's post. I was reminded of this from a recent discussion with a colleague that was conducting a review of a client environment. Backup is a regular activity in just about every IT shop but what is not a regular activity is actually seeing if the backup can be restored to another machine. I actually saw the devastating effect this had at one of the clients I was supposed to engage with. On my second day I arrived at the office to be met by the manager I was working with. He had bad news. The production database failed and the backup failed to restore. In fact, none of the backups they had could restore the database. Since I was there to look at some problems in production that effectively ended that engagement because I couldn't study the application server environment if it had no database to connect to.
That is a hard way to learn an important lesson. No matter how often you take a backup it needs to be regularly tested and restored to another server ensuring the backup is what we think it is and not just a jumble of bits that ultimately has zero value.