Editor's note: Know a lot about this topic? Want to share your expertise? Participate in the IBM Lotus software wiki program today.
Virtualization is a current buzzword in the computer industry. And, as with any buzzword, you should approach it with caution. In addition to the benefits that virtualization provides, inherent pitfalls can accompany any new technology. This article demonstrates that it is possible to implement Lotus Domino in a virtualized environment (VM) with a Linux guest and to achieve a high level of scalability in your production environment. We discuss recent benchmark results and a real-life production example that demonstrates large Lotus Domino scalability in a virtualized Linux environment. We show the results of a recent benchmark where we were able to achieve 102 K Lotus Domino NRPC mail benchmark users in one Linux kernel running under VM.
We also discuss the capabilities of some of the various virtualized environments in the market today and the benefits that they can bring to Lotus Domino. By exploiting these new capabilities, you have the potential to re-architect your Lotus Domino environment and to generate additional cost savings today.
Virtualization and Lotus Domino: What is the benefit?
The benefits that you receive running a virtualized environment depend on the virtualization technology that you choose to deploy. Not all virtualization technologies have the same capabilities or provide you with the same benefits. By understanding the capabilities and benefits of your specific virtualized environment you can avoid the inherent limitations that each of these virtualization technologies possesses.
For example in today's environment, if you are running Lotus Domino on dedicated hardware and one of your servers runs out of resources, what must you do? First you must buy, order, and wait to install your new hardware. Then you must buy and install a new license of Lotus Domino. Then you must build and configure a new Lotus Domino server. Then you must migrate some of workload off the server that is resource-constrained to the new server. In a virtual world, you would need to enable spare capacity or add more capacity to your existing hardware and reconfigure the virtual images for the new resources; that's all. There is no need to manage and manipulate your Lotus Domino servers if your resources are the limit. Virtualization allows you to have your infrastructure fit the needs of your Lotus Domino environment, not to have your Lotus Domino environment fit into your infrastructure limits.
Another advantage of the virtual environment is the ability to reduce costs by performing server consolidation and infrastructure reductions. Because you can dynamically extend new resources in the virtual world, you can architect a Lotus Domino environment that exploits these capabilities. For example, by running multiple Lotus Domino servers in a single Linux image as a guest under VM, you can eliminate the requirement to have physical hardware connections between the servers. Depending on the level of consolidation, you can also reduce or eliminate the current supporting infrastructure around the physical servers.
By reducing the complexity of your Lotus Domino environment, you can reduce the number of servers you need to run in your environment. This reduction in complexity directly relates to a reduction in the cost to administer (that is, to install and upgrade) and support (that is, performance, capacity, problem determination processes) your environment. As we show, thinking outside the box and not just porting existing images to your virtual world can lead to substantial reductions in the cost to run your Lotus Domino infrastructure.
Figure 1 shows an example of moving from a distributed physical Lotus Domino infrastructure to a centralized virtual Lotus Domino infrastructure.
Figure 1. Virtualization deployment example
In this example, we moved from 18 hardware systems, operating system images, and Lotus Domino mail servers to six Lotus Domino mail servers in two Linux guests under one VM.
This implementation assumes that your chosen VM and hardware have the required capacity to support this workload. Also, because the Linux guests run in the same physical footprint, they can exploit virtual LAN (VLAN) technologies. Not only does this ability remove the need for a hub server, but VLAN allows the Lotus Domino servers to communicate with each other at memory speeds, not LAN speeds. This ability also removes the need for a dedicated backbone to handle the server-to-server traffic (mail, cluster, replication, administration) between the servers.
Planes, trains, and automobiles: Not all virtualization is created equally
Just as the title of this section suggests, there are different ways to provide transportation, and they each have unique features such as speed, capacity, and distance. There are also many different ways that virtualization is performed today. Some virtualization is done in hardware, some is done in firmware, and some is done in software. Some virtualization is done using a combination of these approaches. The configuration depends on the virtualization products and the supporting software, firmware, and hardware that you have chosen.
Not all these virtualization technologies, though, address the same business needs or provide the same benefits. Depending on the virtualization technology and how it is implemented, the performance characteristics that the specific virtualization technology provides can vary. The greater the overhead of a specific virtualization technology, the more resources it requires to run that implementation.
Virtualization is not a new concept; it has been around for more than 40 years. Virtualization was first introduced as a product in 1967 with CP-67. The vendors providing virtualization have different histories and experiences, with a lot of recent activity in the last few years. IBM has more than 40 years' experience with improvements and performance enhancements in virtual environments.
Different aspects of virtualization can have different costs associated with them. For example, the ability to virtualize threads on a processor can have a different cost associated with it than the cost to virtualize the I/O. This fact means that you must look at all the costs of your virtual environment, including processor, memory, and I/O, to understand the total cost of running that virtual environment. Because Lotus Domino can have a significant impact on processors, memory, and I/O depending on what features and functions you enable, you must understand how these components work to appreciate the costs of virtualization with Lotus Domino.
As figure 2 shows, there are different approaches to the basic server implementation of virtualization, including hardware partitioning, bare-metal hypervisor, and hosted hypervisor.
Figure 2. Basic server virtualization approaches
It is also important that you understand how your virtualization built its hypervisor implementation. As figure 3 demonstrates, the various implementations have their benefits and issues.
Figure 3. Hypervisor implementation methods
By understanding how your virtualization is implemented, you can have a better understanding of how to exploit its capabilities. The IBM Red Paper entitled "IBM Systems Virtualization: Servers, Storage, and Software" published in April 2008 (REDP-4396) provides a good detailed description about the different virtualization technologies and their benefits and issues.
102 K Lotus Domino benchmark users in one Linux kernel under VM
In the fourth quarter of 2008, a native 64-bit version of Lotus Domino was delivered for Linux. A benchmark was set up with the following two objectives:
- We wanted to see the impact to Lotus Domino’s scalability in a single Lotus Domino partition server when using the new native 64-bit code compared to the older 32-bit code.
- We wanted to see how far we could push one virtualized Linux kernel in a 64-bit operating system environment running Lotus Domino under VM.
The benchmark was performed at the Washington System Center in Gaithersburg, Maryland. For this benchmark, a z10 EC 2097-764 system was used. The system had 64 general processors (not including the I/O processors, spares, and so on) and 1.52 TB of memory. Only a portion of this system was allocated for this benchmark. A logical partition, or LPAR, was built on this system with an initial configuration of four processors and 48 GB of memory. In this LPAR, we built a VM operating system image and then built one Linux kernel under VM, with access to the four engines and 10 GB of memory. In this initial 4 central processors (CPs) or engines/10 GB configuration we would run our initial single Lotus Domino partition server tests.
Note that this system was not dedicated to this Lotus Domino benchmark but was running other workloads at the same time. In addition to the Lotus Domino benchmark, there were two other customer benchmarks running at the same time on the same physical hardware. One of these benchmarks was for a customer in the insurance industry; the other benchmark was for a customer in the banking industry. Each of these benchmarks had its own LPAR and virtualization within the same physical hardware. In addition to these workloads, several other, smaller LPARs were running on this system as well.
The total direct access storage device (DASD) attached to this physical system, through its various fiber channels, was well over 200 TB. For the Lotus Domino workload, we ended up with about 20 TB of DASD attached to this one Linux kernel, defined as extended count key data (ECKD) devices. Many times the question arises concerning whether to use ECKD DASD or Fibre Channel Protocol (FCP)-attached SCSI logical unit numbers (LUNs). There are pros and cons to each choice. The most efficient use of FCP-attached SCSI LUNs in the Linux environment is through dedicated FCP subchannels instead of using z/VM® emulated devices.
In this case, the Linux SCSI stack is used to manage, read, and write data on the target LUNs. The disadvantage to this configuration is that the target disk devices are not visible from the Conversational Monitor System (CMS) environment, so z/VM tools and processes cannot be used to assist with managing these devices. They are visible only from within the Linux environment, and thus they must be managed from within that context. Performance studies conducted by the IBM Boeblingen Lab suggest that using dedicated FCP subchannels and SCSI LUNs provides the best performance in terms of megabytes transferred per second. This solution also tends to incur the lowest processor overhead from a z/VM perspective because it is possible to use hardware qioassist (available on z9® and z10) since the I/O process is based on qdio instead of start subchannel. For a more detailed description of qdio, refer to the document “Queued Direct I/O (QDIO,” in the reference section.
The downside of this solution is that the System z® channel subsystem does not automatically provide multiple-path support to the target storage devices. All multipathing must be done by manually defining a multipath configuration within Linux and using device mapper support for the multipath configuration.
The advantage of using ECKD DASD devices is that they are visible from the z/VM CMS environment and thus can be easily managed from a typical z/VM system programming environment. An additional benefit is that management of multiple paths to the target storage subsystem is contained entirely in the hardware channel subsystem and does not require any manual intervention from the Linux side. The disadvantage is that the System z I/O architecture specifies that only one I/O operation can be active at any time on a specific target device (UCB or DASD subchannel). You can overcome this restriction by using parallel access volumes (PAVs). With the new support available in the latest versions of the Linux DASD driver and the DS8300 hyperpav feature, you can effectively use PAV with Linux on System z and achieve throughput rates that are similar to those of the dedicated FCP environment described earlier.
Our environment consisted of ECKD devices allocated as 3390 model 27 devices. We used ECKD devices because they were readily available whereas an FCP SCSI environment was not.
For the first set of tests, we gave Lotus Domino four CP seconds and 10 GB of memory to establish the baseline for running a single Lotus Domino server in a Linux guest. In this series of tests, we pushed this Lotus Domino instance as far it would go before causing it to crash. This approach allowed us to scale up this Lotus Domino instance and see how far it could go with unlimited resources. Because we wanted to compare previous Lotus Domino release benchmarks, we reconstructed their environments, configurations, and workloads. The basic definition for this benchmark was as follows:
- No transaction logging
- R6Mail workloads
- One Lotus Domino server in one Linux guest under VM
The R6Mail workload is simulated using mail and calendar operations from the Lotus Notes client to the server actions. It sends 4.67 messages per user per hour. For more info about this workload, refer to the developerWorks® article, "The new Domino 6 Notesbench workloads: Heavier by request!".
As figure 4 shows, we were able to reach 19 K to 20 K benchmark users in a single Lotus Domino instance before it failed.
Figure 4. Connected/active 15-minute users during a single Domino PARrtition (DPAR) server test
For these measurements, we measured CP seconds used instead of processor busy rates as shown in figure 5.
Figure 5. CP seconds used during a single DPAR test
The processor busy rate is a calculation of the resource used (CP seconds) divided by resource available. Historically, this calculation had a random numerator (CP seconds use) divided by a static denominator (physical processor resource available). Because half of this equation was static and only changed when a unit was upgraded or swapped out, the processor busy number was a good indication of the resources being used. In the virtual world, though, the denominator can now be a random value as well.
For more than five years now, the number of processors in a given virtual environment can change every 10 seconds in large enterprise hardware. On these configurations, the number of processors available in the measurements is now a fractional number because in any one minute interval it can change multiple times. For example, the number of processors in one minute could be 4.5 and then be 5.166 the next minute. This variation means that the processor busy rate now has a random denominator and a random numerator, which renders this value fairly useless in a production virtual world. By measuring CP seconds, you know the amount of processor resources that the environment used, but not how much is available in a virtual environment that can constantly change.
In our measurements, we used a 10-minute collection interval. This interval means that at 600 CP seconds (10 minutes x 60 seconds), you use the equivalent of one full engine of processor capacity.
The other drawback in the virtual world is that the processor busy rate can be misleading about how much capacity is left. If you have a system with two physical processors and you define two operating system images each with two logical processors, this configuration gives you a total of four logical processors. If both of these operating system images are treated and managed equally by your virtualization technology, then when both operating system images are at 50 percent processor busy rate, then you are 100 percent busy for the physical system and out of capacity. By understanding the amount of resources (CP seconds) you are using, instead of a percentage of a virtual capacity number, you can be in a better position to manage and monitor your environment.
In addition to evaluating the resources used, we also looked at the cost of running the benchmark users. Figure 6 shows the processor cost for Lotus Domino active 15-minute users. This depiction is obtained by dividing the number of processor seconds used by the Lotus Domino server instance into the number of active users for each sample period.
Figure 6. Processor cost per active 15-minute user
Figure 6 shows an interesting but expected trend. The highest cost per user occurs at the start of the benchmark. Because users are logging in and validating their authorization, the processing overhead is associated with this elevated cost. For example, if a server is restarted in the middle of a busy shift, you need to plan for additional processor resources as your entire user population logs back into this server. The actual steady-state cost is lower after users have authenticated with the server and established their communication sessions.
For this reason, benchmarks tend to scale back the timing for loading the last set of users if they are pushing the processor to its limit and to allow some time before measuring the steady state. Additionally, you can see that the cost per user spiked at the tail end of this run before the server failed, indicating that the server was starting to come under stress even before its failure. Because this test was a single DPAR run with all mail being delivered locally, we understood that the cost per user would go up when we went to multiple DPARs and had both local and remote mail to deliver.
One of our objectives in using this benchmark was to see if we could surpass 100 K benchmark users in one Linux guest. Based on the single Lotus Domino partition server tests, we calculated that we would need to run 17 K benchmark users in each of six Lotus Domino partition servers in one Linux guest. We added five additional Lotus Domino partition servers to bring this one Linux guest up to six Lotus Domino partition servers. Also, we added four more CPs, increased the memory to 26 GB, and added more DASD to support the full user load.
Each DPAR was given its own notesdata directory and four unique mail directories under the /notesdatax/mail path. We distributed the DASD for the user mail boxes over these mount points. Each Lotus Domino partition server was defined to run under a separate user ID and was given a unique IP address. Because all of these Lotus Domino partition servers resided on the same physical footprint, we were able to utilize VLANs and transfer data between the Lotus Domino partition servers at memory speeds (TCP/IP buffer-to-TCP/IP buffer transfer) without having to send the server-to-server traffic out to a physical backbone network.
Our first 100 K test was defined to ramp up to 60 K at the same rate as the single server test, then to slow down the users being added. By 50 K users, however, we started to observe issues with the run. Mail was starting to back up, and response times on the clients were starting to elongate. Based on analysis of the Lotus Domino and platform data, we detected an I/O bottleneck. While the DS8000® was delivering response times of less than 2 ms and VM was delivering respond times of less than 2 ms, we observed response times of more than 80 ms out of the Linux kernel.
While we were investigating this issue, we tried several tests to see if we could get around this I/O bottleneck. During various test runs we did the following:
- Reduced the amount of data being written to log.nsf
- Increased the Linux guest memory to 48 GB
- Increased the NSF_Buffer_Pool size
- Moved mail.box(es) (each DPAR had six) to a RAM disk
- Moved names.nsf to a RAM disk
- Moved log.nsf to a RAM disk
- Moved mail.box(es), log.nsf, names.nsf to a RAM disk
- Changed the number of mail.box(es)
- Changed the number of threads in the server task
- Changed the number of threads in the router task for local and remote mail delivery
- Upgraded the Lotus Domino server to the gold-level GA code
- Set MailLeaveSessionsOpen=1
Although none of these changes made any difference in the I/O bottleneck and scalability issue that we had encountered, setting the notes.ini parameter MailLeaveSessionsOpen to 1 did make a measurable difference on the processor resources being used. We saw the processor cost per active 15-minute user go from .07 to .05 processor seconds, which means a reduction of about 28 percent to run the same workloads. This parameter tells Lotus Domino to leave the session for delivering mail open between the various servers. Prior to setting this parameter, Lotus Domino would establish a session to a mail server to which it was transferring a message, authenticate, send the message, and then drop the session. By leaving the session open, Lotus Domino could reuse the existing sessions instead of constantly reestablishing them.
Eventually, we determined that the I/O issue was caused by the way the ECKD volumes were being supported in Linux and how requests were being queued at the logical device level. A fix for parallel access volumes (PAVs) is being delivered in SLES 11, but that fix was not available for us to use at the time of the benchmark. We reconfigured the I/O subsystem to divide the volumes into smaller logical 3 GB drives, thereby creating more logical drives for the same total amount of DASD. This configuration provided us with more logical drives for Linux so that we could spread the I/O out to the same DS8000 systems.
After we reconfigured the DASD, we were able to scale to about 85 K users before we ran out of capacity on the eight CPs. We added four more CPs to bring our VM and Linux guest up to 12 CPs for our next runs. Our virtual guest and Lotus Domino servers receive a 50 percent increase in the number of CPs available without having to rebuild, redistribute, or make any Lotus Domino changes for the additional capacity.
With this additional capacity, we were able to achieve a steady state of 17 K users in each of the six DPARs, for a total steady state of 102 K users in one Linux guest running under VM. Figure 7 shows the active 15-minute and connected user counts for one of these runs.
Figure 7. User count during the 102 K run
During the steady state for the run we averaged about 1.5 million Lotus Domino transactions every 10 minutes during the test or about 9 million Lotus Domino transactions per hour as shown in figure 8.
Figure 8. Lotus Domino transaction rates
During these runs, the total I/O workload into this single Linux guest peaked at just under 27 K I/Os per second. We were able to average a response time of less than 2 ms in the DS8000s, VM, and Linux kernel. During these runs, the other workloads previously mentioned were active on this same server so that the total number of I/O per second into this server was much higher than just the 27 K for the Lotus Domino workload.
During the various runs with different numbers of users, the processor cost per user was stable until we reached the 80 K user mark. At this point, the cost per user started to grow as we added the last 20 K users. We did not see this growth in the single DPAR test going up to 17 K users, nor did see it in the 51 K user test, which is documented in the next section of this article, where we ran multiple Lotus Domino servers with 17 K users. This growth in processor cost per user seemed to be related to overhead in the Linux kernel itself.
For this benchmark and other measurements, we saw about a 10 percent overhead for the first guest under VM. Additionally, there is about a 1 percent to 2 percent overhead for each additional VM guest (not the Lotus Domino partition server) thereafter. This 10 percent overhead divided over the six Lotus Domino server instances means that for this configuration we ran about a 2 percent overhead for each Lotus Domino server in this virtual environment for this benchmark. Note, though, that not all virtualization technologies and platforms can achieve this level of scalability, throughput, and low overhead in a single physical footprint. Therefore, these numbers must not be used for different Lotus Domino virtual implementations from the one detailed in this article.
Vertical and horizontal scalability
When you migrate to a virtual environment for your Lotus Domino infrastructure it is a good time to look at your existing infrastructure. The quickest and easiest way is to move your existing Lotus Domino servers into your new environment. That approach typically does not allow you to fully exploit or best utilize your new virtual environment. One of the measurements we performed was to assess the difference in running a given workload over two different configurations. To measure this difference, we ran identical workloads on the same configurations with one difference: the number of Lotus Domino partition servers.
In our first test case, we ran 51 K benchmark users across three DPARs and determined a cost per user. We then ran the same 51 K users across six DPARs in the same virtual configuration and determined that cost per user. Figure 9 shows the results of these tests.
Figure 9. Comparison of processor costs per user
As we can see, the processor cost was 20 percent lower per user by consolidating the workload to fewer Lotus Domino partition servers. We have also seen this type of reduction in customers’ and IBM production environments when they are able to perform server consolidation. By scaling your Lotus Domino infrastructure vertically (refer to figure 1 for an example), you not only save on the resources that it takes to run that environment, but you also save on the administration cost because you now have fewer servers to manage, upgrade, and support.
The key to fully exploiting virtual benefits is to have a physical/virtual implementation that can allow you to add resources to your environment so that you can grow vertically first without having to rebuild the environment horizontally.
Not just benchmarks, but here today!
"All of this sounds good, but how does this actually work in production?" is the question you are probably asking. IBM has already worked with a customer to migrate 14 K production users into two Linux kernels. Each Linux guest under VM supports about 8 TB of DASD. There are approximately 7 K production users in each kernel, with each guest running multiple DPARS. In addition to their mail servers, they also have name servers and an administrative server inside these two Linux guests for a total of four DPARS in each Linux guest. This configuration was made with Lotus Domino 7 before the advent of the 64-bit version 8.5 Lotus Domino code that allows greater server consolidation today. This customer saw a reduction in the number of Lotus Domino instances that they managed and in their infrastructure setup.
Not only are customers implementing Lotus Domino on Linux in large-scale environments but so is IBM. IBM is in the process of moving its production workloads to Lotus Domino on Linux for System z. The paper "Consolidation of Lotus Domino and Lotus Notes to Linux on System z" is a summary of the activity that has already occurred during this migration. This paper includes a description of the move of 103 hardware systems with 35,100 Lotus Domino applications to Lotus Domino on Linux for System z. Look for additional papers as this work extends to the mail environment next.
Thinking outside the box with virtualization
As we discussed, one of the advantages with virtualization is the ability to dynamically extend your infrastructure to support your Lotus Domino environment. This advantage, coupled with the ability of the enterprise hardware to dynamically change and grow its configuration on the fly, can lead you to new and different ways to run your Lotus Domino environment.
Today, most customers that use clustering run an active/active configuration. This configuration implies that the workload is balanced across both sides of the cluster with the resources being used equally. The other way to run your Lotus Domino cluster is to have it as an active/passive configuration, one in which all the workload is on the active side of the cluster and only the workload needed to maintain the synchronization is on the passive side. The major disadvantage of this configuration is that one server looks virtually empty because it is there "just in case."
The down side to the active/active configuration is that the total workload is the sum of the two halves. In a failure, all the workload from one side shifts to the other side. While most configurations are accurately sized in the beginning, over time both sides of the cluster grow. Do you know if the entire workload can still fit on one side of the cluster? Have you actually disabled half of your cluster during peak prime shift to validate that the workload still fits? Also, not all bottlenecks are processor bottlenecks; do you have the I/O and networking bandwidth to support the peak failover workloads? If you do not actively test and validate your clustering during peak loads, during an actual failure the failover side can become constrained and deliver poor response times or even fail.
In the virtual world, you can run the active/passive Lotus Domino cluster and know what resources it takes to run all your workload on one side because it is already on one side of your cluster. You have several options with the passive side in the virtual world. You can reuse the resources on the passive side for lower priority workloads (test, development, and other non-Lotus Domino workloads) that you are willing to temporarily sacrifice during a failure. You can look at a capacity-on-demand solution. This kind of solution allows you to plan for the failover of some percentage of your population (individual server failures), with the remaining additional capacity dynamically turned on in a full failure. The advantage of a capacity-on-demand solution is that you don’t have to buy 100 percent of the passive hardware until you use it. The choice of whether to reuse or not pay until you need the resources is now yours, but you no longer need to have the spare capacity there, waiting for the "just in case" situation to occur.
Not all virtualization technologies provide the same benefits or have the same cost overheads. Some virtualization technologies (such as the one we ran) have been around for more than 40 years with many improvements and enchantments. IBM has been running Lotus Domino in virtual environment since Lotus Domino 4.5.1. IBM has consolidated most of its Lotus Domino application servers to Linux for System z under VM. IBM has also announced plans to move its mail server to this environment.
While the virtualization environment in System z that we used in this article's benchmark had less than a cumulative 2 percent overhead for each Lotus Domino server instance, that benchmark is not typical of most virtualization technologies. Understanding the cost of your virtual environment is critical to building a successful virtual deployment. Also, the low cost associated with the ability to virtualize massive amounts of I/Os in these examples (benchmark and production) can allow for significant vertical scalability and cost savings.
Virtualization of your Lotus Domino environment can significantly help you to streamline and reduce your total cost of ownership (TCO). Even though moving existing servers in place can provide savings, the true savings come from being able to consolidate your servers (reduce your environmental complexity) to scale vertically. Also, substantial administration and management cost savings can be realized by streamlining your environment and reducing the complexity (with fewer Lotus Domino servers to manage, upgrade, and support) of your Lotus Domino deployment. Finally, having your infrastructure fit Lotus Domino instead of having Lotus Domino fit into your infrastructure can lower your TCO and provide you with the ability to react more quickly and with more flexibility to the needs of your business.
- Participate in the discussion forum.
- Read the IBM Red Paper, "IBM Systems virtualization: Servers, Storage, and Software."
- Read the IBM paper, "Consolidation of Lotus Domino and Lotus Notes to Linux on System z."
- Learn more about Queued Direct I/O (QDIO).