IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & industry solutions      Support & downloads      My IBM     
developerworks > Community >  Dashboard > Tivoli Service Management Center for System z > ... > Overview > New Workloads and the Mainframe
developerWorks
Log In   View a printable version of the current page.
New Workloads and the Mainframe
Added by obriend, last edited by obriend on Mar 22, 2010  (view change)
Labels: 
(None)

 Service Management Center for System z

Home > Overview > New Workloads and the Mainframe

New Workloads and the Mainframe

Over the past 10 to 15 years companies have seen a major change in the kinds of work they process in their data centers. Workloads characterized as ERP (Enterprise Resource Processing) have been a major driver in growing processor capacity. These workloads include products from vendors such as SAP, PeopleSoft, and Siebel, to name a few. They include applications that support critical business functions like HR, financial analysis, customer fulfillment, CRM, and data mining. While these are not new applications for most companies, what is new is the way these applications are delivered and implemented. Instead of home grown and static, these applications are sold as packaged products that provide a base core of functionality and let the user customize the software to meet their individual needs.

The other attribute of these applications is their ability to let the end user construct dynamic transactions that produce customized reports on demand. The dynamic nature of these transactions is the major difference between how these "new" workloads run compared to the static nature of traditional workloads. Think of the difference between a CICS transaction that executes pretty much the same way every time it is called, versus a user created query that scans large data bases and produces a customized report. There are many differences in these kinds of workloads, and it is those differences that are fundamentally changing the role of enterprise computing.

These new workloads come in many flavors and architectures. In a single tier architecture all of these layers (presentation layer, application layer, data base layer) run on a single box. In a multi-tier architecture they can be spread over multiple boxes as well as multiple platforms and operating systems. For large scale enterprise applications, the choice for running the data base layer is increasingly targeted for the mainframe. This trend is an important reason for the resurgence of the mainframe as the central hub for enterprise computing. There are lots of reasons for this choice but we'll leave that discussion for another time. But the growth in these applications has not come without a price. That price is a number of unhappy customers and a renewed discussion on whether the mainframe can really deliver on its promise of lowest TCO.

Consider this scenario. A company embarks on a lengthy project to implement one of these new applications. They begin this project with a sizing and TCO analysis that indicates this will be a sound business investment with a 3 year payback. One year into the project their mainframe utilization spikes, end user response times tank, and they haven't begun to hit the projected volumes for full production. The result is predictable. Upper management begins to question the choice to use the mainframe and fingers assigning blame start to point in all directions. If this sounds like an extreme example, it is not. In fact it is more common than you might think. The following discussion will explain what happened and try to provide some guidance for IT shops who are about to head down this road.

New Workloads Vs Old Workloads

As described earlier, new workloads tend to have a dynamic component which lets end users construct their own transactions for creating reports or other kinds of information retrieval. This functionality provides both a value and a risk. The value is the ability to quickly pull information out of the data base without having to wait for the IT department to code customized transactions. The risk is that these requests can be unstructured and very inefficient in how they run. The very attribute for installing these new applications is also the reason for so much angst when resource consumption soars and response times degrade. But that does not mean these applications are unmanageable, rather it means the traditional way that performance management and capacity planning are done must be re-examined.

Another characteristic of new workloads is that work requests are often submitted from other systems. Many requests use the DDF (Distributed Data Facility) program to transmit remote requests for data base access to z/OS. These requests are initially handled by a distributor program, then routed to another program where they execute under what are called Enclave SRBs (Service Request Blocks). These transactions inherit the WLM (Workload Manager ) priority from the Service Class where they run. While this scheme has been around for several years, it is not well understood by the people who are asked to manage it. That often leads to WLM policies that do not provide the correct goals and importance targets for managing these workloads.

Performance Management and Capacity Planning

Performance Management is the set of processes which ensures that service levels are being met. The first step in managing performance is to define what constitutes an element of work, and what service level it should be managed to. When CICS was the dominant OLTP workload, this was a fairly straightforward process. Most CICS transactions were static, repeatable, and easy to understand. A common approach was to characterize CICS with the following attributes:

  • Total Resources Consumed (i.e. MIPS)
  • Total Transaction Rate (transactions per second)
  • Average Path Length (millions of instructions per transaction)
  • Average Response Time (total seconds per transaction)

Based on these attributes, IT shops were able to define SLAs at the CICS transaction level of detail. For example, an SLA might read "90% of all CICS transactions will complete in less than .75 seconds". One of the ways you manage CICS performance is by setting the appropriate settings in the WLM policy. For example, you can define an Execution Velocity or a response time goal in the policy and WLM will manage CICS towards those goals. The key lever used by WLM is to adjust dispatching priorities. But all of the transactions running in the CICS region run at the same dispatching priority, the priority set for the CICS region. This scheme works well for CICS where most transactions are small consumers of CPU resources. This scheme works less well for the new workloads where transactions can vary widely in their use of resources. For example, consider a new workload with a mix of trivial and long running, CPU intensive transactions. If these all ran in the same Service Class you could find trivial transactions competing equally with the long running transactions. It only takes a few CPU intensive transactions to drive utilization to 100%. In this example the trivial transactions will be competing with the long running transactions in a constrained system. The result will be poor performance for all transactions. The solution to this problem is readily available in z/OS and is the use of multiple period Service Classes.

Multiple period Service Classes allow long running transactions to migrate from one period to the next. As a transaction consumes resources it can migrate from period 1 to period 2, etc. Each period is usually assigned a lower level of importance in the WLM policy. Transactions that make their way to the lowest defined period are considered long running, consumer hogs, and left to soak up CPU cycles that are left over after all higher priority work has run. This is not a new concept in computing. In fact this scheme has been used to manage TSO transactions for over 40 years. What is new is the fact that many early adopters of new workloads have forgotten how to manage this kind of work. Early users of People Soft and SAP were likely to define one or two Service Classes, and let trivial transactions fight it out with those long running hogs.

While new workloads forced IT shops to rethink how they did Performance Management, the impact on Capacity Planning was more insidious. Consider the problem discussed earlier. This is where the decision to deploy the data base tier (or application layer) was made for the mainframe. Soon after implementation, utilization levels spike to 100% and everyone panics. The concern is the solution was not sized properly and the TCO calculation was grossly in error. This can be a traumatic event since people's careers could be on the line. What is happening here is confusion between high system utilization and the system being out of capacity. For many shops, capacity planning consists of tracking system utilization and defining a trigger point for declaring the system is out of capacity. For example, that trigger might be "the system needs to be upgraded when system utilization between 10 AM and 4 PM reaches 95%". There are lots of variants on this trigger, but the point is that it is based on utilization. Most shops, when pressed, will admit they don't do an upgrade until the phone rings and someone is complaining. And further, the person on the other end of the phone is someone they care about. The important point here is that modern mainframes can run at very high utilization. The key determinant on whether an upgrade is needed is not utilization, it is performance, and specifically the performance of important workloads.

Consider the following. A new workload runs on a z/OS system with 2 physical CPs. If this is a z10 EC class machine, each CP provides approximately 1,000 MIPS worth of processing capacity. That equates to 2,000 total MIPS, not a small machine by any standard. If a new workload generates 300 transactions per second, the utilization of the z10 by that workload is easily calculated by the following:

Utilization = average transaction path length x transaction rate / Total MIPS.

In this example, assume an average path length of 2 million instructions per transaction. Then the calculation would be 2 x 300 /2000 = .30 or 30% busy. This is the way traditional sizings have been done and works quite well for traditional CICS based applications. The problem introduced by new workloads is the large skew in transactions between the trivial and the long running. Consider that a new workload might have the same average path length as in the first example (2 million), but occasionally it dumps a long running query into the mix. Say this query needs to execute 100,000 times the average path length before it completes. That means the total time it spends executing is 100,000 times 2/1,000 or 200 seconds. The math isn't important. What's important is that this transaction will be executing on a CP for a long time (200 seconds). And during that time it can drive a single CP to 100% busy (assuming it is CPU bound). If two of these transactions started at the same time it would drive the entire processor (both CPs) to 100% busy and keep it there for 200 seconds. Now suppose that every 200 seconds two more of these transactions kicked off. You get the picture. It takes a very small number of these long running transactions to drive utilization to 100% busy and keep it there for an extended period of time.

New workloads and the dynamic query capabilities they provide are notorious for generating this kind of work. Instead of a relatively constant level of MIPS consumed by CICS we see huge spikes where utilization hits 100% and sometimes stays there for extended periods of time. And when tracked back to its source, we see a small number of transactions that caused the high consumption of MIPS. This is the phenomenon that is causing so much concern over the implementation of new work on mainframes today. The immediate reaction is the system is saturated and needs to be upgraded. Of course this throws the capacity plan into disarray since we haven't begun to hit full production levels yet.

If this company does capacity planning based on utilization, then this might be a valid concern. But if they do capacity planning based on service delivery, and if they have qualified their work into different levels of importance, then this is no big deal. Let's start by re-visiting the concept of multiple period Service Classes. If we look at a few days or weeks worth of RMF reports, we can see how these transactions consume resources. There will be an obvious skew showing most transactions complete in the first period and use few resources. A small amount will complete in period 2, and a very small amount run till completion in period 3. A logical approach to assigning priorities is to assign highest priority to period 1 and lowest to period 3. This ensures we don't let trivial transactions compete with those barn burner hogs. But it also recognizes that we don't need to set unreasonable service levels for transactions that scan the entire data base, burning lots of CPU cycles, and will run for hours. A logical approach is to treat these transactions as discretionary units of work and let them use whatever is left after important work runs. It also recognizes that seeing utilizations at or near 100% busy for extended periods of time is not a sign that the system is out of capacity.

The implications from this scenario and the impact on capacity planning are as follows. New workloads need to be understood both from their total resource consumption and how much skew exists among the different transactions. This can be difficult in the beginning, especially given the way new transactions can be dynamically created. But over time, usage reports can be used to analyze this skew and find patterns. Next, service levels need to be set based on which transactions are important and which ones are not. A typical approach is to treat those very long running transactions as not worthy of any service level guarantee. Next, figure out how much capacity is needed to meet the service levels for the important work. It's usually easier to just subtract the capacity used by those discretionary transactions. This ensures that capacity decisions are based on what's needed to service the important work and not the long running discretionary work. It also takes the spotlight off total utilization and puts it where it belongs.

Conclusions

Many new workload projects get off to rocky starts and call into question the choice of using the mainframe as the data base server. Most of these concerns are unfounded and can be traced back to poor practices in Performance Management and Capacity Planning. While new workloads are different from traditional OLTP workloads, they still obey the laws of computing and can be managed successfully on a mainframe. Many would argue the mainframe is best suited to run these workloads for a whole host of reasons. Yet many mainframe shops have lost critical skills in PM and CP over the last several years. But the resurgence of the mainframe as the central hub for enterprise computing is shining a spotlight on this issue. And that is forcing many shops to re-invest in these skills or turn to business partners and consultants to fill the gaps.

Note: This article was first published by IBM Systems Magazine, Mainframe Edition.

About the Author

This article was written by Marty Deitch of Vicom Infinity, Inc., an IBM Premier Business Partner. Click here to learn more about Vicom Infinity.


    About IBM Privacy Contact