Portal administration and performance
WebSphere Portal's largest value proposition is that it aggregates content, and because of this, it also aggregates lines of business, partners, people, technologies, and various middleware stacks that contribute to the overall IT stratosphere. It's a complicated process to care and feed a WebSphere Portal deployment. I quite often get called in to speak to customers and at conferences on the subject of portal deployment and operations and find myself addressing the same questions over and over.
This blog of mine is an attempt to get some of this information recorded and centralized. Much of it, hopefully, will find its way into product documentation, Redbooks, and white papers eventually. But for now, I will vet it out here.
Thanks for stopping by. Comments are always welcome.[Read More]
When asked if WebSphere Portal supports 64bit, what really needs to be asked is if WebSphere Portal supports running in a 64bit JVM? We have supported running in a 32bit JVM on 64bit hardware for quite some time. We introduced our first 64bit JVM support on the Series i platform with WP 5.0.2. We added 64bit JVM support on zLinux with WP 184.108.40.206 (to overcome the 31bit address space limitation with small maximum heap sizes) and most recently added 64bit JVM support with HP-UX on HP Integrity Servers with WP 6.0.1. We will be adding more and more platforms, especially with new releases, but I typically ask in return if you are sure you really need 64bit JVM support.
Sure, with 64bit you can have very large heap sizes (many gigabytes instead of the maximum of 2GB on most UNIX systems), and thus allow a single application server instance to become CPU saturated before the heap is consumed, but that isn't necessarily a good thing. The larger the heap can grow, the longer garbage collections can take, especially full GCs which require a pause of the JVM while the heap is scanned and defragmented, looking for the maximum amount of garbage to collect. The larger the heap, the greater the potential fragmentation, and thus the longer the full GC cycles. And with the shear number of objects that are created and destroyed every second in a portal, fragmentation in the heap can happen more often than you might think. These pauses can amount to poor user experience.
Personally, I haven't seen any specific data to suggest what the perfect maximum heap size is for a portal, and I'm certain that number will vary by implementation, but based on conversations I've had with performance specialists, I suspect it is somewhere in the 1.75GB to 2GB range.[Read More]
There are a lot of knobs and switches in the WebSphere Portal and Application Server software stack that you can play with in the process of tuning the environment. We provide guidance and recommendations of what to pay attention to through our Performance and Tuning Guide, but the fact is that what actual values work best for you depends highly on what type of applications you run in Portal and the size and activity of your user population. The only way I know of to accurately arrive at the correct set of tuning values is to follow this basic testing methodology:
Performance and capacity testing is a long, highly iterative process. It is also resource intensive, as it requires dedicated systems for days on end, as well as enough people to manage the test environment, observe test results, and tune the system. Often times, this important process is the first thing cut from a deployment plan in jeapordy. In my experience, you either pay for this time up front, before you go live, or you pay for it after you go live. It must be done, and is a lot more expensive the later you do it.
But all this being said, what should your goal be? Obviously, you go into this process with certain metrics in mind. For instance, I want to be able to handle 400 concurrent users with no worse than a 5 second response time, and maybe that is only during login. That's fine, and simple to measure, but there will be days where, for some reason, you have a lot more than 400 concurrent users, or you have to take systems down and the remaining systems must take the load. It isn't enough to know if your environment can handle the typically load; you need to know if it can handle the atypical, or worst case, scenario as well. You may not know what the worst case scenario is up front. But what you do need to know is what the maximum capacity of your portal environment is, so as the usage approaches that number, you will know you are in trouble and need to add capacity.
To understand what your environment's capacity is, you need to drive utilization of your portal environment to the point of CPU exhaustion. Not memory exhaustion, or DB connection exhaustion - CPU exhaustion. The reason for that is that as CPU utilization approaches 100%, the system slows down to the point of nearly being unresponsive, but it doesn't fail. Under this condition, the Web servers managing the load across a portal cluster will mark such a system as down and route traffic elsewhere, giving the server a chance to recover, which it should once the requests have been processed. If you run out of some other resource, like memory or DB connections, before running out of CPU, then things really start to fail and you won't recover from that.
So, as you perform your maximum capacity tests, give the server instances a large enough Java heap size and request and DB connection pool sizes to allow for enough traffic through driving CPU to 100%. If you can't before running out of resources other than CPU, then it is time to scale vertically (creating vertical cluster members on the same physical system) or reallocate processors to other systems. If you can configure the system to meet this goal, then you have your "sweet spot".
I'm often asked how far vertically should Portal be stacked on a single server, and should it be, given the single point of failure a single server provides.
First of all, vertical scaling implies vertical clustering, meaning you host multiple cluster members (instances) within a single WAS configuration profile (node).
Vertical scaling, or clustering, is a really, really good way to take advantage of available server resources. If you deploy hefty servers in production with multiple CPUs and tons of memory, then you are likely running out of heap space before you run out of CPU, which is not good and not the best use of your server's resources (see my previous blog post on finding the performance sweet spot).
Vertical clustering is really easy to set up, especially with WP 6.0. It amounts to just a few clicks of the mouse within the WAS console to create a new cluster instance based on the local node's configuration - no additional tweaking is necessary. You can take vertical scaling only so far, though, before you have to scale out horizontally, adding additional server nodes to your cluster. How far you take your cluster vertically depends again on how many instances it takes to strike the appropriate balance between memory utilization and CPU utilization (refer to previously mentioned blog post). In general, you don't want to run out of CPU (pegged at 100% utilization 100% of the time) while you have plenty of available heap storage left (which means you should probably run fewer vertical cluster members), nor do you want to run out of heap storage before you exhaust the CPU (you have too few vertical cluster members).
The likelihood that you will lose a hardware component is low. With the proper balance between horizontal and vertical scaling, you can get the most out of your hardware's horsepower while also guarding against hardware failures.[Read More]
Virtual portals are ideal for setting up many micro sites that all have the same sets of applications in common. The most common usecases involve central service providers hosting a specific application suite for a series of independent yet similar tenants (departments, sales teams, vendors, etc) that all need access to that same application but want a differently branded experience. This is the usecase for which virtual portal was designed. This allows the maximized use of costly infrastructure and allows more users through the system without impacting the system with too many disparate applications.
That being said, virtual portals can also be used to host totally different user communities with different application needs. However, you do have to be more careful with how having many different applications will impact JVM heap utilization. Applications are shared resources across all virtual portals, regardless of whether the VP is actually using a particular application or not. Therefore, users in VP A can experience the effects of applications running for users in VP B. You can isolate the application environment from the virtual portals using WSRP, running those applications in their own JVM, but otherwise there is no application isolation inherent in virtual portals, nor will there ever be (hence the term "virtual").
If all the tenants use a common set of applications, you can likely scale to a very large number of tenants and virtual portals. If there is very little in common between the virtual portals, then you can expect to only scale so far, how far will depend on the relative expense of the applications in terms of memory consumption.
We have documented in the past that we only support 1000 virtual portals. That isn't actually true. We have only tested (back in WP 5.1) up to 1000 virtual portals. There are no physical limits to the number of VPs we can support. The number of virtual portals a system can support is highly dependent on the size of the VP in terms of pages, the number of users having access to that VP, and the inherent overhead of the applications in the system. There is very little to no overhead in the usage of virtual portal by itself.
For a system with a large number of virtual portals, performance is optimized when:
Otherwise, a large amount of resource (memory and CPU) is consumed keeping up with every user's configuration.
Performance will eventually degrade with very large virtual portal deployments with little content commonality between users, as our caching mechanism becomes less efficient and too memory intensive. What "large" means will vary by implementation, but is most likely in the several thousand range. At that point, you should consider adding parallel portal clusters for future VP instances and federate between them using technology like WebSphere Application Server Extended Deployment, which knows how to route users based on virtual portal ID to the correct Portal cluster.
In WP 6.0, we broke up our configuration and content repository databases into "domains", or database instances/schemas that are organized by function. The intention of these domains is to allow them to be located separately from each other, and in some instances, shared between multiple identical portal clusters. The domain-based organization also enabled some functional features, such as loose-coupling between customization data and static release data, but that isn't the subject of this blog entry.
Geographic deployments is core to WebSphere Portal's ability to deploy global portals and meet 24x7 uptime goals. To facilitate this, certain database domains must be shared across deployments, to ensure consistency of data for all end-users.
The database domains are:
The release, feedback, and likeminds domains should be unique per cluster. Since feedback and likeminds are rarely used, you can include their schemas along with the release domain in the same database instance if you like. Even if those features are in use, to keep the deployment simple, I still recommend maintaining them with the release domain. Keeping the release domain specific per cluster is essential for allowing one cluster to be serviced while allowing other identical clusters to remain in production.
The community, customization and wmm domains should be shared across all identical clusters, as they contain end-user information which must be common across these clusters. That means all clusters can either refer to the same DB instance that holds these three domains, or you can employ 2-way replication to keep them synchronized. If you use 2-way replication, the replication frequency should be less than the event that triggers a user-binding to a particular cluster. You do not want an end-user rerouted to a different cluster BEFORE their data is replicated over. The trigger depends on your global load balancing logic, but I prefer domain-based routing since it pretty much guarantees that a particular user will go to a particular cluster/datacenter unless there is a failure causing the user to be rerouted to another cluster. DNS-based routing is problematic because hostnames could be re-resolved at any time, even during a user's active session with a particular cluster.
The last domain, jcr, is special, because it can contain both release-oriented and user-oriented data. In general, though, I recommend that the JCR be treated as a release domain, since the vast majority of the data it contains comes from staging or authoring environments. Personalization rules and policy definitions are typically developed internally and staged out. WCM content is authored and can be syndicated out to multiple clusters simultaneously. Database based 2-way replication is not recommended or supported for the JCR as a means for reducing the amount of syndication required as it can cause problems with content visibility across all clusters. WCM uses a caching mechanism that relies on syndication as a cue to invalidate cache entries. Without it, users may not see updated content without the servers being restarted.
When there is a suspected memory leak in a Portal application, or in the Portal itself, I typically follow this process to collect the heapdumps necessary to properly debug the problem:
Now that you have three heapdumps spanning a relatively long period of time, you need to analyze the heapdumps to look for possible leak suspects. I use HeapAnalyzer from alphaWorks (http://www.alphaworks.ibm.com/tech/heapanalyzer). Be warned, though, that for heapdumps taken from heaps of size 1.5GB or more, you will need a LOT of memory on the system where you run HeapAnalyzer to analyze the heapdump. I would recommend running it on a 64-bit system where you can configure the tool itself with a massive heapsize (7GB or more). It will take a long time to analyze it too.
Once analyzed, the tool can be used to point out suspected memory leaks. By having the system quiesce (no active requests or session), there should be a large disparity between the leak suspects and other allocated "noise" in the heap, especially as you analyze the two older heapdumps.
In terms of detecting whether you actually have a memory leak situation versus simply running out of memory because of running too many requests through a single portal, look at the verboseGC output. The JVM heap will fill up over time, and sometimes quickly depending on traffic patterns, but once it reaches 90% capacity or so, the JVM should perform a full GC with compaction, to defragment the heap and claim as much memory as possible. I call the point it returns to the "low water mark". If over time, during a load test with a constant number of users, you see this low water mark creep upwards, then you may have a memory leak. Ideally, it should return to about the same point each time.
I have used AlphaWorks' PMAT tool (http://www.alphaworks.ibm.com/tech/pmat) to graphically detail the GC cycles. It is very simple to visually see the pattern and determine if you see the low water mark creeping upwards over time.[Read More]
Setting up new instances of WebSphere Portal can be a time consuming task, especially if you have to repeat that task many, many times to build out an entire infrastructure. Fortunately, the process can be dramatically sped up through the use of cloning or virtualization techniques.
Cloning, in this context, refers to taking a fully installed, configured, and customized portal and using it as a basis for many other instances. So, in this case, you incur the overhead of building out the very first instance, then reuse that as a template for new instances. At the moment, this process of copying portal instances only works for standalone portals, so they must still be separately federated and clustered. The process is described in my paper on Cloning a WebSphere Portal V6 installation. This paper will be updated soon to include WP 6.1.
Probably one of the first things you will note in the referenced paper above is that the cloning process requires the use of WebSphere Application Server Install Factory to mass-replicate WAS installations. The portal portion of the cloning process comes from simply ZIPping up the PortalServer directory. I certainly wish WAS would also support ZIPping up the AppServer directory and its profile, but they do not at the moment, or else my paper could be condensed down to probably 2 pages. We are working towards this, but for the time being, the cloning process is orchestrated through Install Factory.
The other way to speed up deployments is through the use of virtualized OS images. Most people think of VMware when they think of virtualization. VMware, and other OS-level virtualization techniques, brings not only ease of instantiation, but a much more flexible and efficient architecture. For instance, imagine having several VMware images representing different test permutations that can be recalled at will, or being able to litterally replace dated hardware with newer, faster servers and redeploy a current VMware instance to it without having to reinstall Portal. There is also, of course, the use of third party provisioning tools to balance virtual images across a farm of hardware, to ensure consistent and optimized resource utilization and power consumption, as part of a whole Green Lab strategy.
Portal has a limited support statement for VMware, mainly because we don't have enough experience deploying it ourselves, and can't guide our customers on effectively how to use VMware to meet their performance and capacity targets. Rest assured, though, that we are actively working to get this experience. At this time of this post's writing, we are testing various combinations of WP 6.0.1.x and WP 6.1 and will continue to leverage VMware and similar technologies in our test infrastructures.[Read More]
IBM SWG announced today a new "platform as a service" delivery channel for SWG products through Amazon Web Services (AWS):
IBM is OEMing our software to Amazon, and Amazon is charging their users for access to it.
Users can now sign up for an account with Amazon Web Services (AWS) then launch virtual images fully configured with our software in their Elastic Computing Cloud (EC2). Users simply pay by the hour for usage of the image, then throw it away when they are done.
Initially, the offering is for free-for-use development images, called Amazon Machine Images, or AMIs, since the largest community of users of EC2 are developers. Later this year we will introduce AMIs for production use which can be used, for instance, by ISVs/BPs to host their applications on.
Besides Lotus (Portal/WCM), IM (Informix and DB2) and AIM (WebSphere Smash) are also participating.
I see this as a huge opportunity to lower cost of ownership of the Portal platform. I can think of several ways to leverage portal in a cloud:
The HTTP Server plugin that ships with WAS is quite a nice little piece of router magic that ships for free, yet work load management through the HTTP Server seems like such black magic to most people. There are just a few things you should consider when tuning the plugin that might make your life a bit easier:
1) The ConnectTimeout parameter on each <Server> element only governs the HTTP Server's ability to open a socket to the WebSphere Portal HTTP endpoint. This is possible even if the entire Web Container thread pool is consumed. So, even if Portal is hung up and all threads are exhausted, the plugin can still connect to Portal's HTTP endpoint, and the plugin will continue to hammer Portal with more traffic until it is reduced to a bloody pulp. The ConnectTimeout parameter is only good for when the Portal's java process crashes and thus cannot respond to new socket requests.
The best way to defend against a hung Portal is to set the ServerIOTimeout attribute (missing by default, which is why most people miss it). This governs how long the plugin will wait after establishing a connection for data to be returned. Set this value to your pain threshold for waiting for a page to return. After this timeout, that server will be marked down.
2) Performance tests show that "Random" is a more efficient workload management policy than "Round Robin". There is evidence that shows when using Round Robin and a server gets marked down, a larger than normal allotment of that server's traffic is shifted to the very next server in line instead of being evenly redistributed across the cluster.
3) Consider using the <PrimaryServers> and <BackupServers> elements for active/passive configurations, where parts of a cluster can be targetted for normal traffic (PrimaryServer) and other servers in the cluster only get traffic if all of the primaries are marked down (BackupServer).
The developerWorks article for describing and supporting cloning of WP 6.1 installations is now available: http://www.ibm.com/developerworks/websphere/library/techarticles/0902_lamb/0902_lamb.html
mlamb 100000SCY2 Tags:  webservice integration application wsrp external aggregation 1 Comment 4,829 Visits
I'm asked quite often about what the best approach for integrating applications into an existing portal. Should they be rewritten as portlets and run in the Portal itself? What about application isolation, because I don't want a bad portlet taking down my portal? What about the latency effects of running some applications remotely? The answer for what options to use, in typical developer fashion, is "it depends."
First, I think it makes sense to briefly outline your options. You can find more detail on all of these through our product's Info Center as well as from the wide variety of white papers and wiki posts available from the Portal Zone (http://www-106.ibm.com/developerworks/websphere/zones/portal/).
In my next blog post, I'll walk through the decision making process to help you decide which one of the above options would work best for you.
I'm afraid the following decision tree slightly over-simplifies the process of determining what application integration technique should be chosen for WebSphere Portal. It is not meant to be the "gospel" by which all integration techniques are decided. Instead, it should serve as a guide for the types of information that needs to be factored into the decision making process. (This blog post is a continuation of the previous post, where the integration techniques are described in detail.)
The questions below are posited in order. If I believe the answer to a question leads to a certain integration technique, I will say so. Otherwise, I'll direct you to the next question.
After reading through the questions, it may seem odd at first that I lead with NOT starting with portlets, but I found it easier to eliminate the lower fidelity, less elegant solutions first than to try to arrive at the decision to run everyting as a portlet first. There are just too many advantages to using portlets to enumerate as a series of discrete questions, where as there are very few questions that can help you eliminate portlets outright. Hopefully this will make sense to you as you work your way through the list below. Let's get started:
WebSphere Portal and Lotus Web Content Management are on the new IBM Smart Business Cloud for Development and Test. This new cloud offering from IBM is an Infrastructure as a Service (IaaS) offering similar to Amazon's Elastic Computing Cloud, but with several usability enhancements, including a more usable Console and the ability to modify aspects of the image (such as activation scripts) without needing to rebuild the image!
Check it out: http://www-935.ibm.com/services/us/igs/cloud-development/
A little known fact of Portal (and WAS, for that matter), is that when Portal serves up theme files, there is no default cache-control headers in the HTTP response. As a result, these files is not saved in the browser's local cache. Therefore, theme components that are scripts (.js), CSS files and images (served up as HREFs in "secondary GET requests" after the initial portal page GET request is parsed by the browser) are returned to the browser with no directive to tell the browser to cache these for future renders. This means that all of these components are fetched on each and every page render.
Use a tool like "fiddler" or "firebug" to observe this behavior.
This problem could induce long render times for your portal. Users with poor bandwidth, no local edge caching and/or long latency back to the main Portal servers could see dramatically longer render times than users who are local to the portal servers with excellent broadband service.
The solution is to use IHS (Apache) and force these components to have cache-control headers via the httpd.conf IHS configuration file. Then, these theme elements are cached in the browser's cache for a period of time. Here is an portion of an IHS httpd.conf to use as an example to force cache-control headers for static theme elements:
ExpiresByType text/css "access plus 20 days"
ExpiresByType image/jpeg "access plus 20 days"
ExpiresByType image/jpg "access plus 20 days"
ExpiresByType image/gif "access plus 20 days"
ExpiresByType image/png "access plus 20 days"
ExpiresByType image/bmp "access plus 20 days"
ExpiresByType image/x-icon "access plus 20 days"
# The following will modify the Cache Control Header on the WAS content be have the Public attribute and to become stale in intermediate caches after 36001 seconds (10 hours)
# Further, it will force IE to check the server after 18002 seconds (5 hours) to see if an update is available.
# Also, note that if you are using HTTPS (SSL), only "public" content (content with the cache-control header public) is cachable on the browser.
As an aside, note that for some reason, WAS currently does not set the mime type correctly for .png files. To resolve this, add ".png" to the virtual host via the WAS Admin Console.
Note: Apache "Expires" will not overwrite an already existing Expires header. So, if you write a custom WebSphere "FileServingServlet", for example, that places an "Expires: 0" header on all the content, regardless of type, then Apache can't really help you. You should take care in WebSphere code to place an appropriate cache control header on your response to insure that content stays in the browser cache as long as appropriate. Setting a cache-control that will allow content to be refreshed from the browser cache will greatly improve overall system performance on highly utilized sites.