As the virtualization of hardware and software resources becomes more pervasive, resiliency of the middleware infrastructure becomes even more critical. The least resilient services can negatively impact other services hosted on the same virtualized hardware and software—for example, by locking and holding shared resources while invoking synchronous service calls over unreliable networks that might suffer high latency.
This article identifies issues that can impact the resiliency of an SOA infrastructure. The article uses WebSphere Application Server for z/OS to quantify the effects of the issues described. The advanced monitoring of the virtualized hardware and software provided by IBM System z™, the multiprocess server architecture of WebSphere Application Server for z/OS, their inherent mechanisms for ensuring resiliency, and advanced concepts that you can learn from studying the large WebSphere Application Server for z/OS customer deployments let you both identify issues that can impact the broader SOA world and quantify the effects that the proposed solutions will have.
Application servers, virtualized infrastructures, resiliency, and SOA
SOA guidelines promote the development of business services. Business services are essentially reusable business functions that can be shared across an enterprise. The granularity and distribution of these business services emerge as two important factors when designing a new SOA.
Pre-established best practices
Established best practices and lessons learned from related technologies provide strong guidance for building a resilient SOA. For example, from Java™ 2 Platform, Enterprise Edition (J2EE) technology, the collocation of dependant Enterprise JavaBeans (EJB) components on the same application server allows for optimizations like pass by reference as opposed to pass by value, and also reduces the consumption of server resources, specifically the use of one application server worker thread in one server compared to N threads across N servers. It follows analogously in the context of SOA that the collocation of tightly-coupled services should yield similar benefits as those observed in J2EE.
In another example stemming from Common Object Request Broker Architecture (CORBA), serialization and deserialization of data has a dramatic effect on performance. Thus, to build high-performing distributed systems, the designer must be cognizant of the remote objects transferred and streamline this to improve performance. These lessons, and many others, are important when building resilient SOA infrastructures.
Application servers and shared-runtime infrastructures
Application servers have emerged as the de facto platform for connecting service providers with service consumers. Understanding how application servers behave is essential when building a resilient and robust SOA infrastructure. The resiliency of an SOA is especially important when the application servers are running within a shared-resource environment, because the negative effects of a brittle SOA infrastructure are amplified. With shared-runtime infrastructures, service isolation of the hardware-silo paradigm is eliminated at the lowest levels, and services compete for the same hardware resources (such as CPU and memory). The result? A brittle SOA can directly or indirectly affect peer services and subsystems, as shared resources are locked, consumed, or exhausted.
Shared runtime infrastructures host many services and subsystems, including databases (IBM DB2®, Oracle, and others), legacy transaction processing (such as IBM CICS®), platform messaging (such as IBM WebSphere MQ), and so on. All of these services and systems are competing at some level for the shared resources. Service policies, relative priorities, and other such metadata are used by the shared run time to manage the distribution of resources when under load. When higher priority services aren't meeting their defined service-level agreements, the underlying workload manager—the subsystem used to manage the allocation of shared resources—decides which service(s) are given access to the resource and which are not. Therefore, these higher-priority services can negatively impact lower-priority services when the shared system must make allocation decisions.
Shared-resource contention becomes an important design consideration when adopting an SOA. Bad architectural designs can introduce fragility not only to the SOA, but to the entire shared system. For example, Figure 1 depicts a system where a database is used by two application servers that host two separate, independent services. A failure in server 1 could cause all in-flight transactions to be rolled back. The rollback is executed by the database, which is running at a higher priority. So as the database is rolling back the transactions, shared resources could be shifted away from server 2, which had nothing to do with the failure.
Figure 1. Shared infrastructure
Effects of blocking application server threads
Application servers play an integral role within the domain of an SOA. As a platform, they let application developers focus on the development of the business logic while abstracting the underlying hardware and software nuances. They provide services such as security, transactional integrity, high availability, and so on. As a platform upon which business services are deployed, application servers could evolve to become the cornerstone of SOA. Therefore, it's essential to understand some details that could dramatically affect the application server's stability.
Application servers are composed of two fundamental threading components: managed threads and unmanaged threads.
Managed threads are somewhat heavyweight and are associated with metadata, such as the transaction context and security context. These threads are the entities that execute the transactional work of a business application. Managed threads are monitored by workload managers and are used to determine if more resources are required to meet the defined service-level policies. Upon failure, they rollback the transaction in execution. They're normally pooled within the server process for efficient use. Managed threads within the application server directly influence and dictate the server's capacity for processing work (Figure 2 illustrates this scenario).
Figure 2. WebSphere Application Server z/OS threading overview
Unmanaged threads lack the metadata associated with their counterparts, managed threads. They are traditionally lightweight and used for executing tasks such as cache cleanup and object cleanup. Their failure doesn't directly impact the completion of a transaction, and security requirements are minimal.
Managed threads within an application server execute transactional workload and, therefore, directly impact the server's overall capacity and throughput. A single worker thread doesn't execute two transactions in parallel, so the longer a single transaction takes to complete, the longer a worker thread is consumed with processing that single transaction. No new work can be processed until a worker thread has completed its transaction and is free to begin executing the next one (Figure 3 depicts this scenario). The throughput of an application server can be described as the rate at which a single transaction is processed, multiplied by the number of managed worker threads within the server.
Figure 3. Example topology with errors
When synchronously invoking a service—such as a remote Web service, an EJB service, or retrieving data from a database—the worker threads are blocked and must wait until a response from that service is received before they can continue to process the in-flight transaction. Figure 4 illustrates a scenario in which several layers of services are synchronously connected. Within Figure 4, the database service is a "backbone" and is experiencing some sort of latency, such as lock contention in the database or network issues. In this scenario, the workflow is synchronous, such that server 4 can't continue until it receives a response from the database, server 3 can't continue until it receives a response from server 4, and so on up the chain. The longer the delay persists, servers 1, 2, and 3 may become unavailable as more and more of the worker threads within the application servers become blocked. Essentially, any delay in processing within this synchronous connection impacts the dependant services and, consequently, the application servers hosting those dependant services.
Figure 4. Synchronous services and their blocked worker threads
If a worker thread must be blocked, you need to ensure that the thread is blocked for only a reasonable amount of time. As previously stated, a single worker thread can't process multiple transactions in parallel. So a blocked worker thread, because it's waiting on a response from some remote service, can't proceed with the current transaction and can't start processing a new transaction either. With many blocked worker threads, new work can't be executed until the in-flight transactions are completed, which means this new work must queue and wait for resources to become available.
Products like WebSphere Application Server for z/OS have technology that can try to alleviate the impacts of blocked threads by providing some queuing mechanisms. These queues are used to temporarily hold requests when worker threads aren't available. As the worker threads complete their transactions, the next work item in the queue is executed.
Blocked application server worker threads can negatively impact the server's capacity and throughput. Because a single managed thread executes only a single transaction at a time, blocking threads prevents the current transaction from completing and keeps the worker thread from executing the next unit of work.
Figure 5 shows a customer architecture where remote services are invoked from the application server over an unmanaged network. Each service invocation to the banking device blocks a worker thread within the server. Timeouts can serve as a constraint and control the amount of time the worker thread is blocked. For example, if all Remote Method Invocation (RMI) outbound invocations must be completed within 120 seconds, then each service invocation to the banking device takes at most 120 seconds.
Figure 5. Example of customer topology
Suppose that some network issue emerges—a router close to the devices fails or a cluster of devices fails. Figure 6 illustrates the potential impact of such failures within the application server. Many worker threads can suddenly block and wait for some timeout to expire; in the meantime however, new work arrives to the server. If the service rate of the server is less than the arrival rate of incoming work, requests queue up. Advanced application server products, such as WebSphere Application Server for z/OS, take some actions if the work queue grows too large. For example, a secondary work timeout can be enforced and new requests can be denied service.
Figure 6. Network issues and their effects on the customer topology
One corrective action that can take place is known as an EC3 abend, which is where part of the application server (the WebSphere Application Server for z/OS servant region in this case) is evaluated as hung and therefore restarted. Restarting the WebSphere Application Server servant region causes all in-flight transactions to be rolled back, resulting in extra load on the database. Per Figure 1, a transaction rollback can affect other work running within the shared runtime environment. Stacked, synchronous services like the scenario depicted in Figure 4 can also be affected; a service down the chain of stacked services can be performing slowly, thereby reducing the throughput of its callers. In more complex environments, a single service can have cascading effects and can cause unrelated services and subsystems to crash, a scenario depicted in Figure 7.
One particular customer experienced the following issues, resulting in a nightmare scenario:
- A network router had issues and caused many worker threads within the WebSphere Application Server servant region to block.
- Work requests queued and waited for a worker thread to become available.
- Workload management determined that the WebSphere Application Server servant region was hung and issued an EC3 abend to terminate the servant process.
- The EC3 abend caused all in-flight transactions to be rolled back.
- The slow service rate of the server coupled with the effects of the transaction rollbacks caused other WebSphere Application Server servant regions to be rolled back.
- The effects repeated for many other services and subsystems, and caused an outage.
Figure 7. Cascading effects of blocked threads
Blocking worker threads and poor service deployment and distribution can significantly degrade the performance of an overall middleware infrastructure. This is reinforced by best practices that have emerged from the J2EE and CORBA domains. These best practices include collocating interconnected application modules and minimizing the number of remote calls within some distributed domain. Essentially, services that are collocated within the same process (examples might include a Java Virtual Machine [JVM] or an address space) can exploit optimizations, such as passing values by reference, executing on the current thread of execution, and avoiding the traversal of the network stack. Accessing services remotely can cause at least the following overhead.
The service consumer must:
- Serialize the request parameters.
- Traverse the network stack for the outbound invocation.
- Encrypt the request if the security policies dictate it.
- Block the worker thread and wait for the response.
- Traverse the network stack to process the return value.
- Decrypt the response if the security policies dictate it.
- Deserialize the return value.
- Resume the blocked worker thread.
The service provider must:
- Traverse the network stack to accept the inbound request.
- Decrypt the request if the security policies dictate it.
- Deserialize the request parameters.
- Dispatch the request to a new worker thread.
- Serialize the return values.
- Encrypt the response if the security policies dictate it.
- Traverse the network stack to send the return value.
Figure 8 presents a scenario based on an actual customer deployment. For this customer, several components interacted heavily within the scope of a single global transaction. In this scenario, the system was not only susceptible to high numbers of blocked worker threads at any one time, but it also exhibited poor performance and generated high millions-of-instructions-per-second (MIPS) costs.
Let's break down Figure 8 so you understand how to read it:
- Component 1 invokes component 2 over Remote Method Invocation over Internet InterORB Protocol (RMI/IIOP). Component 1 blocks and waits for component 2 to respond.
- Component 2 then invokes component 3 over RMI/IIOP. Component 2 waits for component 3 to respond. Both components 2 and 1 are now blocked.
- Component 3 responds to component 2.
- Component 2 responds to component 1, and the transaction is complete.
Ultimately, two worker threads (one in server 1 and one in server 2) were blocked during this transaction. Also, two RMIs were made. Each of these introduced overhead and reduced server capacity.
Figure 8. Example of a customer transaction
The most expensive operations in terms of performance are usually the serialization and deserialization of the data and return values, and the encryption and decryption of the request and response. This is overhead that, if the services were collocated, would be avoided. Additionally, on platforms like z/OS where the work is billed on a per-executed-instruction basis (this would be an example of MIPS), the charges for incurring this overhead on a per-transaction basis can dramatically increase the cost of the deployment.
Queuing models can help ensure that the SOA is robust and production ready. With queuing models, you both visually represent a system and vary parameters, such as timeouts and average response times, and understand their effects on a system. Figure 9 shows an example of a queuing model.
Figure 9. Example of a queuing model
The key here is the ability to vary both the probability of a timeout and the timeout value. You can then make calculations to determine the service rates for each thread (and subsequently the service rate for the entire server). A queuing model representing one customer's system was able to predict an EC3 abend within thirty seconds of it actually occurring.
Queuing models can help identify the effects of blocking worker threads in an SOA. But to define a queuing model, you has to understand the interdependencies among the components and subsystems. The construction of such models is the focus of a future article in this series.
This article illustrated how subtle—yet important—design issues can have significant effects on the stability of the SOA. Several specific, potentially problematic areas of design were identified that, when implemented without appropriate consideration, could increase the fragility of the SOA infrastructure and, thus, decrease the stability of the overall SOA deployment.
The purpose of the article was to introduce you, the SOA designer, to important factors that contribute, both positively and negatively, to the resiliency of the SOA. Future articles in the series will discuss solutions to the problems described here, including short-term, immediate solutions that help stabilize a problematic SOA and long-term, comprehensive solutions that build resilient SOAs.
Learn
- IBM Redbooks®: Read
Architecting High
Availability Using WebSphere V6 on z/OS"
for a great introduction to Web services and Remote Procedure Calls (RPCs).
- IBM Redbooks:
"Monitoring WebSphere Application Performance on z/OS"
provides an overview of WebSphere Application Server on z/OS.
- Check out the
WebSphere
SOA and J2EE in practice
blog for Bobby Woolf's description of IBM WebSphere Extended Deployment.
- Read the IBM WebSphere Developer Technical
Journal article,
"The
top 10 (more or less) J2EE best practices"
(developerWorks, 2004).
- Check out the
Recommended
reading list: J2EE and WebSphere Application Server,
the definitive overview of related WebSphere Application Server resources.
- The
SOA and Web services zone
on IBM developerWorks hosts hundreds of informative articles and introductory,
intermediate, and advanced tutorials on how to develop Web services applications.
- The
IBM SOA Web site
offers an overview of SOA and how IBM can help you get there.
- Stay current with
developerWorks technical events and webcasts.
Check out the following SOA and Web services tech briefings in particular:
- Get started on SOA with WebSphere's proven, flexible entry points
- Building SOA solutions and managing the service lifecycle
- SCA/SDO: To drive the next generation of SOA
- SOA reuse and connectivity
- Browse for books on these and other technical
topics at the
Safari bookstore.
- Check out a quick
Web services on demand demo.
- Get an
RSS feed for this series.
(Find out more about RSS.)
Get products and technologies
- Innovate your next
development project with
IBM trial software,
available for download or on DVD.
Discuss
- Participate in the discussion forum.
- Get involved in the developerWorks community
by participating in
developerWorks blogs,
including the following SOA and Web services-related blogs:
- Service Oriented Architecture -- Off the Record with Sandy Carter
- Best Practices in Service-Oriented Architecture with Ali Arsanjani
- WebSphere SOA and J2EE in Practice with Bobby Woolf
- Building SOA applications with patterns with Dr. Eoin Lane
- Client Insights, Concerns and Perspectives on SOA with Kerrie Holley
- Service-Oriented Architecture and Business-Level Tooling with Simon Johnston
- SOA, ESB and Beyond with Sanjay Bose
- SOA, Innovations, Technologies, Trends...and a little fun with Mark Colan
Snehal Antani works for the IBM Software Services for WebSphere (ISSW) group and is the technical lead for IBM WebSphere Extended Deployment. His focus is primarily in the domain of infrastructure design for SOA with WebSphere-branded products across all platforms (z/OS and Distributed). He comes from a development background where he helped develop and deliver WebSphere z/OS, WebSphere XD-Distributed, and WebSphere XD-z/OS. Snehal has disclosed several patents and technical publications in the domains of SOA, enterprise application infrastructure, and grid computing. He earned a BS in computer science from Purdue University and will complete his MS in computer science from Rensselear Polytechnic Institute (RPI) in Troy, NY.
Rob Alderman works for the IBM WebSphere Application Server for z/OS development group. He is the technical team lead of the WebSphere for z/OS runtime development team. His development focus is primarily in the area of WebSphere for z/OS run time, where the WebSphere Application Server code interacts with and exploits the native operating system services available from z/OS. Rob earned a dual BS degree in computer systems engineering and computer science from Rensselaer Polytechnic Institute (RPI) in Troy, NY.
Comments (Undergoing maintenance)





