A classic use case for Portal is to integrate backend data into Portal - be it via classic portlets, Scripting portlets, Ajax proxy calls or other means on the Portal and calling backend systems via EJBs, SOAP web services, REST or other technologies. A common issue is for those backends to slow down or hang at times and when those slow down also slowing down or even hanging the Portal. Even only rarely called backends can eventually cause hang situations since the waiting threads will accumulate and no web container threads will be left to handle work on the Portal side. In addition the HTTP server plugin will typically mark down a JVM, transfer the users to the remaining JVMs and so cause more stress since logs have to happen on the other JVMs that now get the existing users (During those logins typically the customization, community and release database are hit and the root cause can easily be misdiagnosed as a problem with database access).
The following solutions (among others) can be used to address the problem:
1. Consistent low timeouts for the backend calls
When calling the backends ensure to have low consistent timeouts for all of the calls - ideally lower than 5 seconds. Nearly all APIs provide the possibility to define a timeout or setting a global timeout in WebSphere. When leveraging the AJAX proxy it allows to define a timeout as well. The timeout for the backend calls should ideally be lower than the Http Server plugin that monitors the request time as otherwise the request will be re-routed to another JVM where it will hag another thread.
A challenge can be that at times a portlet could query multiple backend services and doing so in a sequential fashion. Another challenge can be that possibly certain calls can take longer due to known backend deficiencies while the majority of calls are fast - in that case a low timeout might not be feasible.
2. Portlet Load Monitoring
Portlet load monitoring allows administrators to protect their portal by limiting the number of concurrent requests and the average response time allowed for JSR 168 or JSR 286 portlets. If a portlet exceeds either the defined maximum number of concurrent requests, or the average response time, then Portlet Load Monitoring no longer allows further requests by the portlet. Instead, the portal renders the portlet as unavailable, and the portlet code is no longer called for further requests. This way, the portal installation is protected from non-responsive portlets consuming an increasing number of threads.
For more details see: http://www.ibm.com/support/knowledgecenter/SSHRKX_8.0.0/dev/plmc.html
A challenge with this solution can be to define how many portlets can possibly be called at a certain point of time or that a short slow down would mark the portlet down for a longer period of time.
3. Circuit Breaker pattern
The Circuit Breaker pattern is a general development pattern to break the circuit to a method execution (the slow backend call) once a certain amount of calls has failed by e.g. taking longer than a defined threshold.
It is described in literature as well as many blogs - e.g. from Martin Fowler: http://martinfowler.com/bliki/CircuitBreaker.html
Various implementations exist - many of them open source and easily usable.
For instance Hystrix from Netflix:
Typically the implementations allow configurable values for when a circuit should be reset or health checks that can define if the call should be made or not.
While this option will likely require a little bit of re coding it is very powerful and can save the Portal (or other systems) from failing due to backend slow downs.
An important aspect of solving the issue is testing the error scenarios and having proper monitoring. So artificially slowing down the backends, stopping them during the test while regular load is applied to the Portal can both show if the Monitoring tools that are being used report the issue correctly and if the chosen solution keeps the Portal responding in a defined pattern during the backend slow down.
To make this less of a defined activity, solutions exist that randomly shut down services or slow them down to check the effects.