This article presents the initial steps toward a distributed system that can optimize its performance by learning to reconfigure CPU and memory resources in reaction to current workload. We present a learning framework that uses standard system-monitoring tools to identify preferable configurations and their quantitative performance effects. The framework requires no instrumentation of the middleware or of the operating system. Using results from an implementation of the TPC Benchmark™ W (TPC-W) online transaction-processing benchmark, we demonstrate a significant performance benefit to reconfiguration in response to workload changes.
As an experimental testbed for our research, we constructed an implementation of the TPC-W benchmark (see Resources) using commonly available hardware and software. TPC-W provides a well-defined simulation of an online store that lets you vary the workload and evaluate the system under varying demands.
Our findings represent three important steps toward the eventual goal of constructing a fully adaptive online system:
- We establish that dynamically reconfiguring hardware in response to workload changes has the potential to improve performance.
- Using uninstrumented middleware and given only raw, low-level system statistics, we show that it's possible to predict which of two configurations will outperform the other at any given time.
- We extend this prediction capability to make precise numerical predictions of the quantitative change in performance when the system is switched to each alternative configuration. With such information, performance gains can be traded off against inevitable reconfiguration costs.
This article describes our experimental setup, details our work in establishing the importance of autonomous reconfiguration, reports our results, and discusses the potential impact of our work.
Large servers are now available that can be partitioned into one or more logical subsystems (see Resources). Each logical system has memory and processors available to it, so it can operate as if it were an independent physical machine. Letting each logical subsystem run its own instance of the operating system (OS) prevents them from interfering with one another through resource contention.
These servers can also be flexibly configured to allocate different amounts of memory and processing resources to the logical subsystems. But allocation of resources to the subsystems that maximizes performance for one set of workloads might be suboptimal for other workloads. Real-time reconfiguration might be needed to maximize performance under a variable workload.
We simulated reconfiguration of logical subsystems on multiple desktop computers to perform our research. The remainder of this section details the testbed setup.
The TPC-W Benchmark is a standardized benchmark designed to determine the relative performance of a System Under Test (SUT) when used to run an online bookstore. The benchmark operates by having an external machine or set of machines -- the Remote Browser Emulators (RBEs) -- run a set of Emulated Browsers (EBs). The EBs represent the bookstore's individual customers. The customers can browse through the store, view products, perform searches, and sometimes place orders.
The TPC-W specification defines the relative probabilities of the customers' actions. It defines three workloads, called mixes:
- The shopping mix represents normal operation of the system. 80% of the accesses are users browsing the available catalog, and 20% of the accesses are orders being placed.
- The browsing mix represents a slow commerce period. 95% of the users are browsing, and only 5% are ordering.
- The ordering mix represents a rush on the latest hot book. Browsing and ordering users are evenly split.
A total of 14 Web pages can be retrieved. They are divided into six browsing pages and eight ordering pages. The probability of a customer moving from a given page to any other page is well defined by the specification, and each page has its own expected response time.
Pages are generated dynamically in response to user queries, and some pages require significantly more processing than others. For example, because some of the ordering pages update the prices and stock in a highly used database, they consume more resources than simply pulling up the home page. Results are normally measured in Web Interactions per Second (WIPS) -- the average number of page requests that return in a second.
Commercially built, tested, and published systems often have one main database server and many independent Web servers. They also often have distinct Web cache servers, image servers, and load balancers. For simplicity, our setup uses one database server (the back end) and one Web server (the front end), as illustrated in Figure 1. The WIPS numbers our system produces, though three orders of magnitude less than those of a large commercial system, are not out of the ordinary for experimental systems.
Figure 1. The three machines used in the physical setup
In Figure 1, the thick, dashed rectangles represent physical machines. The dotted rectangles are processes, and the innermost rounded rectangles are logical units. Network connections are shown as lines and come together at the point where they are managed. Apache Tomcat (see Software) handles routing of connections to the application and image servers, while the physical machine coalesces the individual EBs' network connections.
A TPC-W implementation requires three software modules to support the SUT and drive the benchmark: a database server, an application server, and an image server. We use PostgreSQL 7.4.6 as the database server. The front end uses Apache Tomcat 5.5.4 as a combined application server and image server. The Java™ code the application server runs to generate the Web pages (and interface with the database) is derived from the code freely available from the University of Wisconsin PHARM project (see Resources). This code implements both the servlets and a Java RBE, which is used to run the benchmark. We made slight modifications to this code to let it work with Tomcat and PostgreSQL (see Resources). The RBE was modified to retry any inabilities to connect to the front end, rather than treat them as fatal errors.
The system's physical setup uses three identical desktop workstations with a 2.8 GHz Pentium 4 processor and 2 GB of RAM each. The machines are networked using built-in gigabit Ethernet interfaces and a gigabit Ethernet switch. As you can see in Figure 1, one machine acts as the back-end database machine, one is the front-end Web server, and a third machine drives the benchmark by hosting the RBE.
These physically distinct computers are meant to represent logical partitions of a single reconfigurable computer with a total of 2.8 GHz processing power and 2 GB of RAM. Memory and CPU power are artificially constrained on each machine to simulate partitioning of one such machine into a front-end and back-end machine. Overall, one full 2.8 GHz processor and 2 GB of RAM are available for use by the front end and back end combined. By constraining CPU power and memory simultaneously, we can simulate any desired hardware configuration. The processes are also designed to be reconfigurable on the fly. We can simulate reconfiguring the system to give more memory or CPU power to one machine by first constraining the other to the new requirement, and then unconstraining the newly available memory or CPU power on the target machine.
It's not self-evident that adaptive configuration is necessary to maximize the throughput of a TPC-W system. Our first goal was to verify that that the best configuration is a function of workload. We identified two workloads and two configurations such that when running workload x, configuration A gives better results, and when running workload y, configuration B gives better results (see Table 1).
Table 1. WIPS results for each configuration and workload
|Configuration A||Configuration B|
In Table 1, q, r, s, and t are the WIPS results for the given configuration and workload.
For all workloads in TPC-W, the database does more processing than the Web server. Our initial experiments showed that the configurations that maximized throughput dedicated much of the CPU to the database back-end machine. Detailed experimentation on the shopping mix indicated that with about 7/8 of the CPU (about 2.5 GHz) on the back end, and the remaining 1/8 on the front end, the system is roughly in balance.
After high-level analysis of 15 workloads run on 15 configurations with roughly 7/8 of the CPU on the back end and various splits of memory, we identified five likely workloads and two configurations for further investigation. The first configuration (Configuration 1) had 27/32 of the CPU and 3/8 of the memory on the back end. Configuration 1 maximized the throughput of the workload with 200 EBs running the ordering mix, and workloads of 250 and 300 EBs running the shopping mix. The second configuration (Configuration 2) had 30/32 of the CPU and 5/8 of the memory on the back end. Configuration 2 led to higher performance for workloads of 350 and 400 EBs running the browsing mix (see Table 2).
Table 2. Mean WIPS for chosen configurations over a variety of workloads (standard deviations in parentheses)
|200 ordering||250 shopping||300 shopping||350 browsing||400 browsing|
In order to establish conclusively that configuration matters in this case, we tested each configuration and workload independently 53 times. To eliminate any interference between tests, the database and servlet engine were started before and shut down after each test, and the database files were copied from originals. This prevented any growth in the databases, or servlet persistence issues, from having an impact on the results.
Analysis of the data confirmed that there were statistically significant differences between the two configurations, shown in Table 2. The probability that the differences were due to random chance was less than 10-10 in all cases. Note especially the large differences in WIPS between the two configurations with 250 and 300 EBs running the shopping mix, and the difference in WIPS with 400 EBs running the browsing mix. These large differences conclusively confirm that neither configuration is optimal in all situations. This result establishes the need for adaptive system configuration in order to take advantage of the optimal allocation of resources for the current workload.
The results so far show that the system can improve its performance if it can adapt its configuration as the workload changes. But in practice the workload isn't an observable quantity to the system. This section presents results indicating that the optimal configuration can be determined from low-level OS statistics with no customized instrumentation. The lack of instrumentation lets this approach work for any TPC-W implementation, regardless of the software used to implement the database, Web server, and so on.
In addition to collecting WIPS results during experiments, the front-end and back-end machines also collect low-level system statistics. The statistics are collected in parallel on both machines through the
vmstat command, a commonly available system activity reporting utility (see Table 3). In order to determine the currently optimal configuration, we aimed to create a model that maps current system state, as represented by
vmstat, to the optimal configuration. Using the results reported in Table 2 as training data, standard machine-learning methods can be used to learn such a model.
Table 3. Statistics reported by
|Memory (KB)||VM used||Idle|
|Swapping (KB/s)||Swapped in||Swapped out|
|System (per second)||Interrupts||Context switches|
The Weka package implements many machine-learning algorithms for exactly this purpose (see Resources). In order to obtain human-understandable output, we applied the JRip rule learner (see Resources) to the training data. As a baseline for analyzing the learned rules, accuracy can be compared to the model that always predicts the most likely outcome (Configuration 1 in this case). The accuracy of this baseline learner is 61.9%. By comparison, JRip learned the rules shown in Figure 2, yielding a prediction accuracy of 93.0%.
Figure 2. JRip rules learned by WEKA
1. If (Number of front-end system interrupts <= 1392.8) and (Number of blocks sent by the front end to devices <= 201.2) then Best configuration = Configuration 2 2. If (Number of front-end context switches <= 422.0) and (Number of runnable processes on the back end >= 18.4) then Best configuration = Configuration 2 3. If (Number of blocks received by the front end from devices >= 499.5) then Best configuration = Configuration 2 4. If (Percentage of CPU time spent by the back end in the kernel <= 12.4%) then Best configuration = Configuration 2 5. Else Best configuration = Configuration 1
As desired, the rules learned by JRip are human-interpretable. Rules 1 through 4 indicate cases in which Configuration 2 is the preferred configuration, while Rule 5 classifies all remaining cases as preferring Configuration 1. These four rules can be divided into two sets: Rules 1 and 2 help identify situations where the front-end system has excess CPU available that the back end could use. Rules 3 and 4 determine that the back end is overutilized.
Of the first two rules, Rule 1 indicates a situation where the front system appears to be underutilized. Weka finds thresholds for the number of system interrupts taken by the front end and how many blocks it is sending to assorted block devices (most likely the network sockets). If the front end falls under both thresholds, the back end is the bottleneck. The second rule chooses a different method -- the number of context switches per second -- for determining if the front end is underloaded. However, it also uses a threshold on the back-end machine to determine that there are more processes runnable than can be handled with the current configuration and that the additional CPU would be helpful.
The third rule indicates that the back end is receiving blocks (both from the network and the disk) at a fast enough rate that it would benefit from more CPU and memory. Finally, the fourth rule indicates that the CPU is spending very little time handling kernel-level work. This rule is a little odd. However, Weka is likely determining that the back end does not have enough CPU to handle the necessary system-space work, and is trying to handle too many things in user-space code simultaneously.
Even more useful than predicting the best configuration is an ability to predict the actual benefit of switching configurations, in terms of increased (or decreased) WIPS. This ability is important because some cost is always involved in switching configurations. This cost needs to be taken into account when considering a configuration change.
For example, consider moving a CPU from one logical machine to another. Even when this move can be done without impacting the actual current work, the CPU will still be unavailable for a period of time. A similar problem applies to moving memory between machines. In a worst-case scenario, the machine might be temporarily unavailable while the system reallocates the hardware.
The Weka package includes an algorithm for M5P model trees that can learn function approximations (see Resources). We used this algorithm to predict the changes in WIPS when changing configuration. As before, the learner had access to only the averaged raw system-level data from both systems and the current configuration. The change in WIPS was defined to be 0 for a change to the current configuration. Accuracy of the M5P trees was compared to a baseline learner that always predicted a constant change in WIPS equal to the overall average.
Statistical analysis of the results indicates a sizeable improvement over the baseline learner. The complete M5P trees are too complex to display, but the small degree of prediction error (see Table 4) shows that the change in WIPS is a learnable function, without the need for internal instrumentation in the middleware.
Table 4. Error results of predicting the change in WIPS from one configuration to another
|Target||Learner||Mean abs.||Root mean sq.|
The work this article reports is a step toward building a self-configuring system. An implementation of this approach on real configurable hardware remains to be done, but it is possible to analyze its potential impact. For example, consider a workload that alternates evenly between 300 EBs running the shopping mix and 400 EBs running the browsing mix. The data in Table 3 indicates that Configuration 1 would have an approximate throughput of 14.34 WIPS, while Configuration 2 would have an approximate throughput of 13.98 WIPS. By comparison, a system running adaptively, using the JRip learner, would have a throughput of 14.69 WIPS, if switching time is negligible. This computation assumes that 93.0% of the time, the winning WIPS value is attained, and 7.0% of the time, the losing WIPS value is attained. These numbers represent a 2.4% performance gain over Configuration 1 and a 5.1% gain over Configuration 2. As the disparities between the configurations grow (for example, if the differences were 5 WIPS instead of 1), this gain will also grow.
Our work makes two main contributions. First, it demonstrates that there is a need for self-configuring systems. Some systems might have a clear optimal division of resources to maximize throughput, but in certain situations this division is workload-dependent. Because of this reliance on the workload, the learners we have outlined here appear to have a real use in distributed systems.
Second, having learners able to predict the change in WIPS is a critical step toward a fully functional self-configuring system able to maximize performance on the TPC-W benchmark. Because the cost involved in changing configurations can vary, this prediction allows a threshold to be set that would control when enough benefit would result from a configuration change to overcome the temporary cost of moving resources around.
A key feature of this work is that no instrumentation of any code is necessary. All learning was done based upon easily available raw system statistics. Because no middleware instrumentation is necessary, it is possible to change any or all parts of the system, including the front-end and back-end software, the TPC-W implementation, or even the OS or physical hardware. Retraining would be necessary, but no additional code modifications would be needed. Finally, the entire system could be replaced with another benchmark, such as Sun's Java Pet Store Web-based store simulation (see Resources). In contrast, most other adaptive systems appear to need instrumentation built into the applications and OS.
The rapid development of reconfigurable servers indicates that they will become more commonly used. As this hardware is deployed, it can be used for distributed applications where isolating the front and back ends is desirable, but where extra hardware is unneeded. In these cases, the server will need some form of adaptability to deal with changing workloads.
We've presented preliminary research into methods to handle variable workloads by dynamically reallocating hardware resources between the machines. In addition to showing that autonomous reconfiguration has the potential to improve performance, two learners were presented that predict preferable configuration changes with high accuracy. Our ongoing research agenda includes fully automating the adaptive process under dynamically changing workloads, first on our simulated reconfigurable machines and eventually on true reconfigurable hardware.
- Towards Self-Configuring Hardware for Distributed Computer Systems by Jonathan Wildstrom, Peter Stone, Emmett Witchel, Raymond J. Mooney, and Mike Dahlin: Original paper from which this article was adapted, published in Proceedings of the Second International Conference on Autonomic Computing, 2005.
- LPAR Heterogeneous Workloads on the IBM eServer pSeries 690 System: A paper describing the combination of a variety of significant workloads onto a single large system in multiple independent logical partitions (LPARs).
- Getting Java TPC-W to work with Postgresql and Tomcat: Information for getting the PHARM Project's TPC-W implementation to work with the Tomcat servlet container and PostgreSQL database.
- Data Mining: Practical machine learning tools and techniques, 2nd Edition by Ian H. Witten and Eibe Frank (Morgan Kaufman, 2005): Definitive text on Weka and data mining.
- Fast Effective Rule Induction by William H. Cohen: A paper evaluating the IREP rule-learning algorithm on a large and diverse collection of benchmark problems. JRip is an optimized version of IREP.
- Inducing model trees for predicting continuous classes by Yong Wang and Ian H. Witten: Information on the Weka package's algorithm for M5P model trees.
Get products and technologies
- TPC-W: TPC Benchmark™ W (TPC-W) is a transactional Web benchmark that measures system performance in a controlled Internet commerce environment that simulates the activities of a business-oriented transactional Web server.
- Java TPC-W Implementation Distribution: A publicly available implementation of the TPC-W benchmark, written completely in the Java programming language, from the PHARM Project at the University of Wisconsin.
- Weka: A collection of machine-learning algorithms for data-mining tasks.
- Java pet store demo: A sample application demonstrating how to use the capabilities of the Java EE 1.3 platform to develop scalable, cross-platform enterprise applications.
Jonathan Wildstrom is a third-year Ph.D. student at the University of Texas at Austin. He completed his undergraduate work in Computer Science and Discrete Mathematics at Carnegie Mellon University in 1999. Immediately following that, he began work at IBM on the AIX operating system kernel, where he focuses on the areas of process management and memory allocators. His current research focuses on self-optimizing distributed systems and autonomous reallocation of resources.
Dr. Peter Stone is an Alfred P. Sloan Research Fellow and Assistant Professor in the Department of Computer Sciences at the University of Texas at Austin. He received his Ph.D. in Computer Science in 1998 from Carnegie Mellon University. From 1999 to 2002 he was a Senior Technical Staff Member in the Artificial Intelligence Principles Research Department at AT&T Labs - Research. Peter's research interests include machine learning, multiagent systems, robotics, and e-commerce. In 2003, he won a CAREER award from the National Science Foundation for his research on learning agents in dynamic, collaborative, and adversarial multiagent environments. In 2004, he was named an ONR Young Investigator for his research on machine learning on physical robots.
Emmett Witchel is an Assistant Professor in computer science at the University of Texas at Austin. His award-winning doctoral dissertation at MIT was on Mondriaan Memory Protection, an efficient, fine-grained memory-protection system. While at MIT he published work on reducing energy consumption in caches, and low-power instruction sets. In 1997 he co-founded a company that developed a multiplatform static instrumentation technology for efficient monitoring of program control flow. Before MIT he published several papers as part of Stanford University's SimOS project, including Embra, still the fastest reported full machine simulator. He is interested in computer architecture and how the architecture is used by operating systems and compilers, and in applying machine learning to systems.
Raymond J. Mooney is a Professor in the Department of Computer Sciences at the University of Texas at Austin. He received his Ph.D. in 1988 from the University of Illinois at Urbana/Champaign. He is an author of more than 100 published research papers, primarily in the area of machine learning. He is program co-chair for the 2006 National Conference on Artificial Intelligence, a former co-chair of the 1990 International Conference on Machine Learning, a former editor of the Machine Learning journal, and a Fellow of the American Association for Artificial Intelligence. His recent research has focused on learning for natural-language processing, text mining, relational learning, semi-supervised learning, bioinformatics, and autonomic computing.
Mike Dahlin received his Ph.D. in Computer Science from the University of California at Berkeley in 1995 under the supervision of professors Tom Anderson and Dave Patterson, and he joined the faculty at the University of Texas at Austin in 1996. He received the NSF CAREER award in 1998, the Alfred P. Sloan Research Fellowship in 2000, a departmental Faculty Fellowship in Computer Science in 1999, a Cisco University Research Award in 2000, and IBM Faculty Partnership Awards in 2000, 2001, 2002, and 2004. His research interests include Internet- and large-scale services, fault tolerance and security, operating systems, distributed systems, and filesystems.