Sizing your IBM Lotus Domino mail servers
You have just finished sizing the requirements for your IBM Lotus Domino partition servers (DPARs) on the new hardware to which you will install, consolidate, or migrate. If you fast-forward a bit into the future, however, you notice that your DPARs are taking more CPU resources than your sizing projected. How are you going to fit your remaining users into your new DPARs? Why were they undersized to begin with?
In this article, you learn why some Domino sizings tend to be inaccurate after the first couple of months and about the impact on your sizing -- and thus your DPARâs CPU resource -- requirements. You also learn how to monitor and how to measure your DPARs to ensure they are using the CPU resources appropriate for your sizing and workloads.
This article does not provide a formula for sizing because each of the various Domino platforms have different tools and processes to do this. Instead, you learn how to greatly improve the validity of your sizing information so that these tools and processes can provide a much more accurate sizing. Specifically, you:
- Look at how you might be sizing today
- Review factors that affect your sizing
- Learn how to collect data from existing DPARs to facilitate sizing your new DPAR's CPU requirements
- Review a sizing case study and example
This article assumes that you are an experienced Domino administrator and that you are familiar with the various features and functions of Lotus Domino. This article addresses Domino DPARs running release 6 or later on any platform.
"How did I end up here?"
This is the question many administrators ask themselves after they size their DPAR requirements, either immediately after implementation or a couple of months later when they encounter CPU resource issues. In cases that involve server consolidation and/or platform change, there is usually an assumption that Lotus Domino on the new platform is not performing as well as it did on the old platform. Another tendency is to assume that Lotus Domino is not scaling as efficiently because there are fewer DPARs running with higher user rates than before.
As you peel back the layers of how you arrived here, a key factor to understanding what happened is the data with which you have to work. While customers typically gather CPU resource utilization numbers, such as CPU busy, they typically do not collect and manage their Domino statistics. DPARs have a dynamic workload that can change not only day to day, but also minute to minute. At any given instant, you do not know which workload is being directed at your DPARs by your end users.
Customers tend to use the Domino transaction count (server.trans.total) as a way of measuring the workload going through their DPARs. However, the Domino transaction count (as given to you by a Domino server) is not an atomic measurement of workload going through your servers. Rather, it is a measurement of the NRPC traffic between your DPARs and either Lotus Notes clients or other Domino DPARs. Upgrading your Notes/Domino release (either on the server side or client side) can change your transaction counts while you are running the same workloads.
Also, changes in other Domino tasks that are running (replication, AdminP, indexing, and so on) impact your transaction counts, even if your users are performing the same actions. Finally, non-NRPC protocol clients, such as HTTP, IMAP, and POP3, can have a profound change on resource utilization while showing little change in your transaction counts.
It is important that you not only know how many users (registered, connected, and active 15-minutes) you have on your DPARs, but also that you understand which workloads they are asking your DPARs to perform. For sizing purposes, IBM capacity planners generally group mail users into four different categories based on workload characteristics: light, medium, heavy, and power users. Because each category has a different CPU usage profile, you can easily miscalculate the amount of resources needed to support your user community, if you do not understand which category your users are in or what mix of categories you have.
Factors that affect your sizing
There are many different factors that can affect the CPU sizing of your DPARs, several of which are discussed below.
Allowing user-written agents to execute on your DPARs is like your users having a blank check for any amount of CPU resources on your servers. Because you have no idea what these agents can do, how often they intend to run, or how many users run them, there is no way to project which resources they consume. Agents have a tendency to be shared by many users across an organization and to spread quickly, especially if they have a function that is seen as useful; they can go from one person to two to four to eight to 16 and so on very quickly.
Even the best-intentioned agents written by a Notes administration group may not perform as intended. For example, internally at IBM while looking at resource utilization on one of our mail DPARs, we found that the polling agent that validates whether or not the server is available was consuming more resources on the server than the server was using to route mail. After the agent was rewritten, its CPU use was dramatically reduced.
A poorly written agent can wreak havoc with your DPARs. In another example, we found an agent that was scheduled to run hourly. Due to the way it was written, however, it was sequentially reading through every document in every view, instead of targeting only its intended documents. This resulted in the agent taking more than an hour to run, but because it was scheduled to run hourly, the agent never ended and ran seven days a week, 24 hours a day. This single agent in the one DPAR alone was consuming almost one engineâs worth of CPU cycles on this four-way CPU box or almost 25 percent of the total CPU capacity.
If you allow unrestricted agents on your servers and do not test them prior to allowing them on a production DPAR, then be prepared for large spikes in CPU utilization as these agents come and go from your user population. If you must allow personal agents to run on your server, it is recommended that you at least have a review process to understand which agents are there, in what quantities, across which users, and that you measure your DPARs to account for them.
The average size of your mail file impacts how many resources your DPARs need to support your user population. The larger the average mail size, the more resources it takes your DPARs to process tasks. The Router task uses more resources to deliver these messages to other servers, and the Server/HTTP tasks use more resources to deliver these messages to your clients. A user population that sends large mail messages or messages with large attachments is more costly to support than the same population sending smaller messages. Obviously, the more messages that a user community sends, the more resources that are needed to support it.
Inbox and database size
The size of your users' mail files also affects the amount of resources needed to support the user community. The developerWorks article, "Best practices for large Lotus Notes mail files," discusses the impact of your mail file size, the number of documents in the Inbox view, and the resources required to support your user community. The article clearly shows that both the size of your database and the number of entries managed in the Inbox are critical factors in determining the amount of resources needed to support your users and that having more documents in an Inbox is more costly than having a large mail file.
Clustering and replication
If you set up clustering and/or replication between your DPARs, this affects the amount of resources needed to support your users. While clustering has a somewhat predictable overhead (based on mail volume because it's an event-driven activity), replication can have an unpredictable impact on the requirements to support your users. With clustering, only those databases that have a change are pushed and replicated. If you use scheduled replication, however, all databases targeted for replication are replicated, regardless of whether or not any change was made to them (including opening the databases and searching for any differences). Also, the smaller your replication interval, the greater the resource impact on your DPARs because CPU cycles are used more often to find that there is nothing to be replicated.
Scheduled replication is nevertheless recommended, even if you are using clustering. This allows the DPARs to occasionally synchronize the databases on each side of the cluster and to ensure that they are identical. You can also use a Program document to kick off a replication at DPAR startup, which provides you with an event-driven replication in case of a DPAR restart rather than scheduling one every hour or so just in case. The trick is to replicate only a few times a day with clustering.
The receiving side of clustering and replication is the Server task, so most of the cycles/cost of these functions are included in that task rather than the Cluster/Replication tasks. Failure to account for clustering/replication cycles in your Server task can lead to undersizing your requirements for these features.
If you allow full-text indexes to be used on your DPARs, then you must not only account for the management of these indexes, but also the searches performed on these indexes. It is not the managing of the full-text indexes that takes a lot of resources, but the actual searches. You need to know how many searches are being directed at the indexes.
If you're using transaction logging, you must account for the extra cycles needed to run it. As with any feature that provides you with additional benefits (improved data integrity, faster server restarts, and greater DPAR scalability), you must plan for its CPU resource requirements.
You must also look at any other Domino feature, add-in function, or third-party product that can impact resource requirements. For example, some things to consider are virus scanning, backup and recovery, RIM/Blackberry usage, single-copy template, message tracking, network compression, and customer-written applications that interface with mail. All these activities in varying degrees impact the amount of CPU needed to support your user community.
How administration is managed on your servers
While the above list tends to include discreet features, functions, or workload characteristics, the way in which you configure and manage your DPARs can also have an impact on your CPU utilization and on your DPAR sizing requirements. For example, if you allow AdminP or Domino Directory replication to be executed on your DPARs during prime shift, you must plan for the cycles required to support these features, in addition to your user workload. Although replication of your Domino Directory may never be eliminated during prime shift, you can control the frequency at which it occurs. Obviously, the more often you push Domino Directory changes across to your DPARs, the greater the resource impact.
Running AdminP during prime shift can be somewhat like running agents in that you never know exactly how many resources your AdminP updates require. For example, if you change a name or a group and you enabled field-level security in the databases, then every document in each database must be searched to determine if that name exists in that field, which can be a very CPU-intensive process to perform. At the same time, other AdminP changes can be very inexpensive to run and go rather quickly. The key here is that you never really know which resources the next AdminP request requires. As a rule of thumb, it is recommended that you run AdminP outside your prime-shift user window to eliminate the spikes in CPU usage from your DPARs. Otherwise, you must plan to allocate additional resources for them.
In addition, exercise care when you run Compact, Updall, Fixup, or other maintenance tasks during prime shift because there can be a large impact on your resources.
Types of clients
The type of client your users run has an impact on the CPU requirements to support that user community, but because the cost of a client changes over the releases of Lotus Domino, it is impossible to provide an exact client cost value here.
For example, letâs say that an NRPC client takes 100 CPU units in the current release and client X takes 200 units, yielding a 2:1 ratio. If in the next release, you see a 30 percent improvement in the NRPC and 15 percent in client X, then the ratio becomes 2.45:1. This does not mean that client X is worse in the new release (it actually is 15 percent better). However, when compared with the cost of the NRPC client on the new release, client X takes proportionally more resources than before. If we reverse the percent improvements so that now NRPC is 15 percent better and client X is 30 percent better, then the new ratio is 1.65:1.
In addition, not only can the cost of each client with respect to an NRPC client vary, but each platform may also show differences. For example, in release 7, Linux on Intel showed a dramatic improvement for NRPC clients over release 6.x, while Linux for System Z did not. This was due to the fact that the major changes for scaling in Linux were first developed and placed into Linux for System Z in release 6.5, so these dramatic changes were not reflected in their release 7 numbers. This occurs across all the platforms as an enhancement is developed. It is ported back into core code and other platforms, where feasible. Thus fluctuations among improvements in the various clients supported across Lotus Domino's various platforms must be accounted for.
HTTP users can expect to take significantly more CPU cycles on the DPAR than an NRPC client. This can range up to six times more than an NRPC user, depending upon your use of Lotus Domino Web Access or Web mail because processing that had been performed by the Notes client is now pushed back on the server to perform. With a Notes client, the server can transmit the data and let the Notes client format and present the results. However, with an HTTP client, the server must do most or all of the data formatting/presentation as well.
POP3 users take approximately 20 percent less CPU than an NRPC client. By default, POP3 is a client-centric protocol. Mail is pulled down from the server to the client, deleted from the server, and then managed and navigated on the client. However, you can configure POP3 to leave the mail on the server and run it as a server-centric-type protocol as well, thus changing the cost of running POP3 clients against your DPARs.
IMAP users may take up to approximately 60 percent more CPU than an NRPC client. By default, IMAP is a server-centric protocol. Mail is left on the server, and the client must continuously communicate with the server as users navigate through their mail. Although some IMAP clients can be configured to pull mail down from the server and run it on the client, you must understand which features/functions you enable and their potential impact.
What your users do
Simply knowing the number of registered users on a DPAR does not give you an accurate indication of the amount of resources that it uses. It is possible that two DPARs can have very different resource utilizations, even though they have the same number of registered users, the same number of active users, and the same types of clients. Registered users can affect backup and polling activities, but your active users affect your CPU utilization, and your connected users affect memory utilization.
As mentioned earlier, users can be classified into four different categories: light, medium, heavy, and power. As you would expect, the amount of CPU to support a thousand light users is much less than the amount needed to support a thousand heavy users. Each user type is defined below with respect to sizing, as used by the Lotus Domino for zSeries team, along with a summary in Table 1.
Light users use email only without the scheduling feature of calendar. They may occasionally receive appointment notices, but never schedule meetings themselves. They never send/receive email attachments and have message sizes of 10 KB or less. They send or receive no more than 10 messages/day (including Internet mail), evenly distributed throughout the day. Mail files for light users are under 50 MB. Users who start out as light quickly learn more about the product and advance to become medium users, so for correct sizing, you should not overestimate the number of light users.
Medium users use email together with light calendar and scheduling (C&S) functionality (one or fewer appointments/day). They occasionally send/receive mail with small attachments or graphics. Their average message size is 25 KB with most messages under 100 KB, and they send/receive 10 to 25 messages/day. Their mail file sizes range from 50 to 200 MB. Most users fall into this category. If you're not certain of your users' work habits, choose this category of user activity.
Heavy users exploit more email and C&S functionality (five or more appointments/week) than medium users. Their message sizes are larger (50 KB on average), and most mail messages are under 500 KB. Heavy users send/receive 26 to 40 messages/day, and their mail file sizes are greater than 200 MB.
Power users use Lotus Notes/Domino for core job functions. A power user may be an administrative assistant, who manages several calendars and often uses free-time search to schedule meetings, or a technical expert who is an intensive user of mail, C&S, and other features of Lotus Domino, such as full-text indexing/searching, mail agents, and complex rules. The average message size for a power user is 75 KB, but the typical message size is 100 KB or more. Power users have 10 or more appointments/week, send/receive more than 40 messages/day, and their mail file sizes are greater than 200 MB. Generally speaking, relatively few users in your community should fall into this category.
Table 1. Summary of user-category definitions
|Lotus Notes and Lotus Domino Web Access users||Light||Medium||Heavy||Power|
|Messages per day||Less than or equal to 10||10 to 25||26 to 40||> 40|
|Average message size (KB)||Less than or equal to 10||25||50||> 75|
|Most messages less than||N/A||100 KB||500 KB||N/A|
|Most messages bigger than||N/A||N/A||N/A||100 KB|
|Attachments||No||Yes, but small||Yes||Yes|
|Calendaring functions||Appointment notices||Yes, but light||Yes||Yes|
|Scheduling functions||No||Yes, but light||Yes||Yes|
|Appointments per week||One||Five or fewer||Five or more||10 or more|
|Full-text indexing, searching||No||No||Maybe||Yes|
|Mail agents, complex rules||No||No||No||Yes|
|Mail file size (MB)||< 50 MB||50-200 MB||> 200 MB||> 200 MB|
In addition to these classifications, you also need to understand which workloads -- including the use of local mail replication and directory catalog -- can be offloaded to the Notes client. By using a local mail file with the Notes client, much of the traffic between the client and DPAR is eliminated because the client navigates through the local mail file for most client activities. By using a directory catalog with the client, you can also offload the Domino Directory lookups that occur when mail is addressed. Instead of type-aheads going over the network to your DPARs, you can have them directed to the directory catalog available locally on the Notes client.
By using these features, you can offload activity that is normally executed on the server to the clients, and thus reduce your DPARs' CPU resource cost. However, setting the local mail replication interval to a low level can have the opposite effect because the Notes client polls your DPARs for new mail every few minutes and drives up CPU usage. You should not set the client replication interval to less than 15 minutes. If you need a replication interval less than this, you should consider not using a client replica and use the userâs mail file on the DPAR for a lower CPU cost.
It may not be simple to map your users into the above-outlined four categories; instead, it may be easier for you to classify your users by job description (factory, clerk, IT, student, teacher, teller, stock broker, and so on). Then you can map each job description to one of the user categories, giving you a general understanding of user distribution and anticipated workloads. You can refine the definitions as your understanding grows.
If you do not take into account the growth pattern of your user community, whatever sizing you perform today may soon be insufficient as your users grow in their complexity or your mail/mail file volumes grow.
In one case, a customer had a 5 percent monthly compounded growth of their DPARs' CPU utilization, translating to an almost 80 percent growth in server requirements year-over-year. There were no quotas or limits on the size of user mailboxes, no limits on personal agents, or limits on full-text indexing or searching. Furthermore, there was no archival or management strategy for old mail, so users kept everything in their Inbox.
While your DPARs may not be as extreme as this example, failure to include growth into your sizing impacts how long your projected hardware can support your environment.
Peak load versus average load
As you look at sizing your DPARs, you must ensure that you size for the peak workload that the DPARs must handle and not the average workload. For example, on the first day back after a holiday weekend, a DPAR restart in the middle of prime shift or a mass mailing can cause a much higher CPU peak than would normally occur. If you do not allow adequate resources to handle these situations, then the DPARs can get into a CPU-stressed situation and panic, deliver poor user response time, or failover to other DPARs in your cluster.
Be sure you understand what your statistics are really telling you. Are you looking at the full 24-hour interval, or are you looking at prime shift interval? By merely changing the sample view of your data, you can significantly change the values in your reports. For example, a server may run an average CPU utilization of 60 percent for a full day. However, looking at the same data, but filtering it for peak loads during prime shift, you may see an average of 85 percent utilization with individual CPU spikes approaching 95 to 100 percent. Although the 60-percent number is valid for the full 24-hour period, it is not an accurate representation of the actual peak workloads. If you use that number for planning a new server, you quickly run out of capacity on that box when the 95-to-100-percent spikes occur.
You must also apply the same understanding when looking at your Domino statistics; many of which are cumulative in nature. You receive this data either from a show stat command or by looking at the Statrep.nsf database, meaning that the value is always increasing from when the server was first started or when the value was last reset. If you do not regularly collect this data and look at it only on an interval basis, you may not only miss what is happening on your DPARs, but also when and how often it happens.
When your data was captured
All servers go through some sort of cycle in their workload activity. During the summer months, for example, typically more people take vacation time than during the rest of the year. If you size your server based on your DPARs during the month of August, you may be underestimating your requirements.
Certain business cycles may also drive a higher or lower level of activity on your DPARs. For example, in the weeks before a new release ship or quarterly/yearly business reports, you expect to see a higher activity rate in your DPARs than the time following a new release. Because you want your DPARs to support peak loads, you need to understand what your business cycles are and what time frames best represent these peak loads.
Application versus mail workloads
Most sizing and benchmark numbers from IBM revolve around mail workloads because they are repeatable, allowing us to be consistent in our test methodology. Lotus Domino can run many customer-written application workloads besides mail, but because these workloads use the various Domino features in unique combinations, it is almost impossible to predict a sizing for them. In this case, a benchmark is generally required to produce an accurate sizing for your custom application. You may also run a pilot of your application and then, based on your sizing, offer the actual utilization of this pilot. However, there may be an inherent limitation in your application, in Lotus Domino, or in a server that limits the vertical growth of a single Domino image and that requires you to run multiple copies.
For example, a new release of Lotus Domino may have a 25 percent performance improvement in CPU resources. This may allow you to consolidate two DPARs into one on a new piece of hardware that you plan to buy. However, if the application is inefficiently written or Lotus Domino is incorrectly configured, such that some sort of semaphore locking occurs (or other limiting factor), then you may never reach the full CPU utilization within a single DPAR before a bottleneck occurs.
Monitoring your servers
You may quickly become overwhelmed by the amount of information to consider when sizing your servers, but it's important to remember that Lotus Domino can provide you with much of this data. At a minimum, you should consistently collect monitoring data on your DPARs to understand their growth patterns and to determine the impact new features and functions have on them. This can help you understand your current usage and why you may be using more resources than you projected.
Information about the items listed in the "Factors that affect your sizing" section earlier can be found in your Statrep.nsf database, if you have enabled statistic collections. By default, only Events are stored in this database; your DPAR statistical information is not stored there unless you enabled the Collect task. You can also set up a single DPAR to collect the statistical information from multiple DPARs, so you can have a single consolidated Statrep.nsf database instead of many. You do not need to enable the Collect task across all your DPARs, only on those that collect the information for you.
Figure 1 shows a sample Statrep.nsf database with statistical data in Lotus Domino 7.
Figure 1. Sample Statrep.nsf database view
In figure 2, one of the records is opened to show a sample of what one area of that record looks like.
Figure 2. Sample statistics record
You can export this statistical information from Lotus Domino to a flat file that you can load into either a database or spreadsheet. Figure 3 shows a sample of an exported file.
Figure 3. Sample exported file
Analyzing data from Statrep.nsf
Let's now discuss several charts based on data obtained from a Statrep.nsf database from an internal production DPAR at IBM. These charts present the data in several different views as follows:
- Interval: All values plotted on a 7x24-hour basis
- Shift interval: All values plotted that fit within prime shift, which is defined as Monday through Friday 8 AM to 4 PM
- Daily: One point for each day. Each day's point represents the full 24-hour view of that day; however, depending upon the data type, the data may be summed, maximized, minimized, or averaged.
- Shift daily: One point for each day. Each dayâs point represents the prime-shift view of that day. Again, depending upon the data type, the data may be summed, maximized, minimized, or averaged.
NOTE: Some of the charts may also use two Y axes to show the data, allowing you to see the relationships between multiple items on the charts that have very different values. The legend in each chart clearly shows which Y axis is to be used.
Figure 4 shows the number of NRPC connected users (server.user) and the number of NRPC active 15-minute users (server.users.active15) for this DPAR.
Figure 4. Number of NRPC connected users and number of NRPC active 15-minute users
By determining how many NRPC active 15-minute users there are in your peak periods, you can relate this to the total registered community to determine what percentage of your total user population is active in your DPAR.
It is recommended that you use the NRPC active 15-minute counts (server.users.active15) rather than the connected user counts (server.user) to obtain your percent active number. The connected user counts is affected by the DPAR's server-session-timeout Notes.ini value, meaning that identical workloads can yield two different NRPC connected user counts, depending upon the DPARâs session-timeout value. The active 15-minute user count does not change, but is directly related to your userâs activity.
Although you can obtain some of the statistics in Statrep.nsf, such as the user counts, directly from each document, other statistics must be derived by comparison of the difference between two intervals.
While the 21 million transactions in figure 2 looks like a lot, remember that this is the total since the DPAR was started or since this statistic was explicitly reset. If this server has been up for one or two days, then this is a large number of transactions; however, if the server has been up for a month, then this may not necessarily be a lot. By comparing the various documents in Statrep.nsf, you can build an interval history of how these statistics have been accumulating during the DPAR's operation.
Figure 5 shows an example of an interval view of the number of Domino transactions over a three-week period for this DPAR.
Figure 5. Example interval view of the number of Domino transactions over a 3-week period
By changing this to a daily prime-shift view (see figure 6), you can easily see what your transactions rates were during prime shift over the same period. Also, by allowing this data to accumulate, you can build an accurate view of your DPAR's workloads and resource utilization as well as any changes.
Figure 6. Prime-shift daily view for the same period
As mentioned earlier, understanding your DPAR's workloads is key to an accurate sizing. In addition to users, you can look at any of the data in your Statrep.nsf database to see what the various features of Lotus Domino are doing. For example, figure 7 shows the number of mail messages delivered locally on this DPAR by mail size over a several-week period for prime shift.
Figure 7. Number of mail messages delivered locally by mail size
You can see in figure 7 that there were large spikes in the number of mailings in the 1-to-10 KB (blue line) size range, but what is of more interest are any spikes in the 100 KB-and-up range (black, pink, and turquoise lines). The details of the larger message sizes can be seen on this same chart, plotted on the second Y axis (see graph legend). Although these volumes are much lower than the green line, the size of the message can actually cause a greater spike in CPU resources on this DPAR. If you plot this information on a daily chart (like figure 6), you can see if over time the size and number of messages being delivered locally to this DPAR are changing. If they are, this can be an indication of a growing workload on your DPAR, in which case you need to plan for the additional resources.
Figure 8 shows an example of the amount of clustering activity occurring in one DPAR.
Figure 8. DPAR clustering activity
Just like the mail delivery data in figure 7, this can clearly show any changes in your clustering activity over time. By understanding how your workloads are changing and at what rates, you can better size your serverâs requirements for your DPARs.
Almost all the items mentioned above in the "Factors that affect your sizing" section of this article can be captured, analyzed, and charted from your Statrep.nsf database in this manner.
In addition to Statrep.nsf, look at your native platform statistical information (not just the Domino platform statistics), which provides additional detailed information beyond Lotus Dominoâs view about how things are running within your server/DPARs. Figure 9 shows the CPU relationships of the top 12 processes in a DPAR as these tasks relate to the Server task. The Server task was chosen for this example because this DPAR is supporting Notes clients, and all NRPC client requests are handled by the Server task. By relating these tasks to the Server task, you can understand how the CPU resources are being consumed by the various tasks that make up this DPAR.
Figure 9. CPU relationships of top 12 DPAR processes
The key to understanding this figure is that it is a proportional relationship of CPU resources; in other words, if more users were added doing the same workloads, the lines on this chart would not change. If, however, the activity rates on this DPAR change--for example, more agents, indexes searches, or replications are performed--the lines on this chart would also change, indicating the area you should be looking at for the different workloads.
If this chart had shown the Agent Manager task taking more cycles than the Server task, then you would know to account for this additional CPU resource in your sizing. (High CPU consumption by the Agent Manager, AdminP, Update, Replica, and other Domino tasks has been known to occur). By understanding what these various task are doing, you can better understand which resources are needed to support your user community.
For example, in the lower middle portion of figure 9 you can see that the Compact task (turquoise line) was running during prime shift. While this is a single occurrence in this sample, for sizing purposes, you should determine how often these tasks run during prime shift. This type of chart allows you to quickly profile your DPAR usage and to understand the relationships among the tasks that are running. The large spike in the Router task was a mass mailing, another example of a workload spike for which you must plan.
To produce this chart, the CPU used by each task was divided into CPU used by the Server task for each interval. You can just as easily select the HTTP, IMAP, or POP3 task as your base task, if they are your primary clients. For a hub DPAR, use the Router task as the base task.
Case study: server consolidation sizing
In this case study, let's examine a server consolidation opportunity in which there is an existing set of Domino DPARs on a set of hardware that is running out of CPU capacity. The plan is to consolidate the user community onto a new set of larger hardware servers. The number of DPARs is to be consolidated as well. A pilot was set up in which roughly 10 percent of the existing Notes user community was migrated to the new DPAR on the new hardware to validate the sizing.
After the targeted 10 percent of the user community was moved onto the new DPAR, it was discovered that these users were using almost 50 percent of the total CPU resources for the entire user community. To understand what happened, data was captured from all the existing DPARs and from the newly created DPAR, and then compared.
Table 2 shows the results of the analysis of this data. The second column shows the analysis of the newly consolidated DPAR, and the third column shows the analysis of the original DPARs with the remaining 90 percent of their userâs population. Because the original DPARs were running an older release of Lotus Domino, the active 15-minute user counts were not available from them.
Table 2. Analysis of server consolidation study
|Description||Newly consolidated DPAR||Average of all old DPARs||New to old||Single old DPAR||New to single old|
|Percentage of total users||9.554||90.446||N/A||31.405||N/A|
|Peak connected to registered users (percentage)||73.767||73.059||1.010||51.228||1.491|
|Peak active to registered users||56.054%||?||?||?||?|
|Transactions per registered user per hour||196.625||137.059||1.435||72.42||2.715|
|Agents executed per registered user per hour||.303||.173||1.751||.069||4.39|
|Calendar appointments per registered user per hour||.529||.147||3.598||.109||4.853|
|Average mail (MB) transferred per registered user per hour||.835||.582||1.435||.390||2.141|
|Average database size (MB) per registered user||597.118||195.263||3.058||87.212||6.847|
As you can see, the user activity on the newly consolidated DPAR was much heavier than on the original DPARs. The comparison of the workloads on these two DPARs is shown in column 4. To understand this difference in workloads, let's analyze the users who were migrated to the pilot DPAR.
First, the original DPARs were resource constrained on both CPU and disk. The Notes administrators decided to move the users with the largest databases to the newly consolidated DPAR to relieve their disk issues. What they had unintentionally done, however, was to consolidate corporate-wide every power and heavy user into the newly consolidated DPAR. The resource utilization to support this group of heavy/power users was much more than that to support the rest of the user community.
Another indication of this phenomenon was that the newest/lightest users were placed on one of the original DPARs. The analysis of this single DPAR can be seen in column 5. Comparing this server to our newly consolidated DPAR, you can see that the difference in the results (column 6) was even more dramatic.
If the pilot users had been migrated from this DPAR, there would have been a very different CPU usage curve on the new DPAR, in which case we would have wondered why we had sized for so much CPU when this pilot group of users was using so much less CPU than the amount projected.
What is key here is that either of these groups of users (the heavy/power or the light) alone would have presented an inaccurate projection of the total user community. This is why one DPAR with the same number of users can have much different resource usage than another, even though they have the same number of users on the same clients.
Also, there was latent workload in the old DPARs. Because the response time from these DPARs deteriorated, users used them as little as possible. However, after the users were moved to the new DPAR and experienced its improved response time, they increased the amount of mail they sent. Moreover, the existing DPARs saw a workload increase as their response times improved, and the remaining users increased their activity as well.
In looking at any of your existing servers, you need to determine whether or not they currently have some resource constraint because latent demand on those servers shows itself after the constraint is removed.
Sizing example exercise
In this example exercise, you can see how the workload definition dramatically affects the amount of hardware resources required to support a fixed user population. Although this is a sizing example, IBM sizing tools rely not only on the derived benchmark data, but also on analysis and feedback from both customer and internal production environments.
NOTE: This example is for illustration purposes only. It is not intended to imply that this sizing applies to a specific release of Lotus Domino on a specific platform.
Let's assume there are 10,000 registered users all using Lotus Notes as their client. Twenty percent of the users are active in a 15-minute interval (classified as medium users). Transaction logging and virus scanning are enabled. No other add-in task or third-party product is running in this initial sizing. Given this information, you project that you need 97 percent of a single-engine processor to support this user community. Because this does not leave much room for workload spikes, you are better off allocating a second engine for a two-CPU box with a projected usage of 50 percent.
Let's increase the active rate from 20 to 30 percent of the registered users. This seemingly small change in concurrency actually increases the overall workload requirements by 50 percent, resulting in a projected CPU utilization of 76 percent of a two-way box or 52 percent of a three-way box.
Now, if you change all the users from medium to heavy, this results in a projected CPU utilization of 77 percent of a four-way box or 62 percent of a five-way box. Next, let's add in clustering, so that 100 percent of all the users are clustered with no planned scheduled replications. This changes your projected CPU usage to 81 percent of a five-way box or 68 percent of a six-way box. Next, move 25 percent of your users from Lotus Notes to HTTP/Lotus Domino Web Access. This changes your projected CPU utilization to 82 percent of an eight-way box or 74 percent of a nine-way box.
This simple example demonstrates how the resource utilization changes for the same 10,000 users by simply changing which clients, workloads, and added features the users are running. You went from a one-way to a nine-way box to support the same number of users by making some "simple" changes. While this seems rather dramatic, this type of creep is possible in your DPARs. As your users become more complex in their activity and more experienced with the features of Lotus Domino, as they change to a different protocol/client, or as you change the way you manage/administer your DPAR, your resource usage is affected.
In summary, you need to size your DPARs based on the amount of user workload and not merely on the number of registered/active users that are on a DPAR. A much smaller set of power and heavy users can consume your serverâs resources and lead to performance problems, while the same server can support a much larger set of light and medium users.
It is important to select a representative sample of your user population when attempting to project DPAR resources and future sizing. Although an individual DPAR may vary from your sizing projection, you must look at the total overall use of all your DPARs to understand whether or not you meet your projected sizing for your entire community.
You must monitor your DPARs to relate your actual workload and resource usage against your sizing projections. By understanding how your workloads and resource usage vary from your projections, you can better predict how long your current server can support your current environment. You can better size your next server because you have a better understanding of your users' workload and trends.
- developerWorks Lotus article, "Best practices for large Lotus Notes mail files"
- developerWorks Lotus article, "New features in Lotus Domino 7.0"
- developerWorks Lotus Performance page
- Download a trial version of Lotus Domino 7 from developerWorks.
- Download a trial version of Lotus Notes 7 from developerWorks.