Who better to announce the significance of Server Health Monitoring than Al Zollar, Lotus General Manager, in the Opening General Session at Lotusphere 2002. Al described Tivoli's new product offering- IBM Tivoli Analyzer for Lotus Domino, including Server Health Monitoring-as a fundamental change to the way that Domino administrators work. This tool keeps you ahead of your user population by providing greater server up-time, more efficient utilization of your existing resources, and improved Domino server responsiveness. It also helps you reduce your total cost of ownership.
Once you put this tool in action, you can see these benefits for yourself. The good news is, you don't have to wait for Domino 6 to start using Server Health Monitoring; you can implement it now, in your R5 environment. Then, when you migrate to a Domino 6 environment, you'll already have experience with this tool and your servers will easily fit within this monitoring "network."
In this article, you'll learn how you can take advantage of the Server Health Monitoring features in an R5 or Domino 6 environment. This article is for both new and experienced Domino administrators and assumes only basic experience administering a Domino network.
What is the Domino Server Health Monitor?
Simply put, the Domino Server Health Monitor is a tool that watches, analyzes, and lets you know when a facet of your server needs closer attention. It's better than a visit to the doctor's office (the metaphor we usually use). Instead of one appointment, this monitoring goes on all the time (24 x 7 coverage). You, the administrator, don't even have to be present. Instead, you can do other tasks, while monitoring and analysis occur in parallel. You can also be confident that the metrics are under constant surveillance-both the metrics that you typically look at and even those that you don't have time to review.
Server Health Monitoring was a high priority feature and a top goal of the development team, so in addition to supporting Domino 6 servers, Server Health Monitoring will also monitor and analyze information supported by Domino R5 servers. Our goal was to support the configurations that you have and deliver the best end result with what is offered in each release.
Getting Server Health Monitoring going
Your current production environment probably includes a Domino Administrator client–caliber system. You'll need a Notes 6 client on that system. The Administrator client software is needed to enable Server Health Monitoring, and that will be included in an All Client install of Notes 6. In addition, you must install IBM Tivoli Analyzer for Lotus Domino to enable Server Health Monitoring in your environment. Note that IBM Tivoli Analyzer for Lotus Domino requires a separate license.
You should also note what is not required:
- No changes are needed to your server configuration to participate in the monitoring process. This means the Domino server does not have to go through any upgrades or changes.
- There are no requirements to have any other Tivoli software products loaded. You can, however, run Server Health Monitoring and other Tivoli products together in the same environment.
Once you've installed the Administrator client, there are a few configuration options you should be aware of. Here's what you need to do to get Server Health Monitoring up and running:
- Read the Domino 6 Administrator Help section on Server Health Monitor. This documentation includes detailed information on how to configure and how to use the tool, enhancing many of the points discussed in this article.
- As part of the administrator's role in the monitoring process, you will need to have remote console access to the server. This information is specified in the Server Configuration document in the Domino Directory. This should already be configured and enabled for your server. As with Server Monitoring, you must be authorized to access the server in order to monitor it with Server Health Monitoring.
- Platform Statistics is a key component in the Server Health Monitor analysis process. It provides information about CPU, memory, and disk utilization. (With Domino 6, Platform Statistics also supports and captures network utilization information.) We strongly recommend that you configure for all options available in Platform Statistics. (Platform Statistics is available in Domino 5.0.2 and later for Windows NT and Sun Solaris, and in Domino 5.0.3 and later for iSeries OS400. Domino 6 supports Platform Statistics for all platforms.) To monitor the disk counters, you must run a tool called diskperf, which is shipped as part of the Windows NT operating system. (This feature is also available for Win32 platforms, although you may have to configure your operating system to enable it. See the Domino 6 Release Notes for more details.) The other collection metrics are available by default.
- Although not discussed in detail in this article, another set of metrics can be used by Server Health Monitoring if they are enabled. These are known as the QOS (Quality of Service) statistics. These statistics help gauge response type information for specific server tasks. Please consult the Domino 6 Administrator Help for information on configuring this set of statistics. The Domino ServerTask parameter found in the server's Notes.ini file will also have to be modified to include "runjava ispy" for the monitoring task.
Then, to configure Server Health Monitoring, use the Domino 6 Administrator client:
Enable the statistics gathering, analysis, and recommendation generation by choosing File - Preferences - Administration Preferences, clicking the Monitoring tab in the Administration Preferences dialog box, and selecting the "Generate server health statistics and reports" checkbox. This option is not selected by default and must be selected for Server Health Monitoring to run.
Figure 1. Monitoring tab of Admin Preferences dialog box
To enable Historical Charting, click the Statistics tab and select the "Generate statistic reports while monitoring or charting statistics" checkbox. You should select this option so that you can perform historical analysis about a server's health over time. Information is stored locally on your client, in the Monitoring Results database (statrep*.nsf). This option is not selected by default.
Figure 2. Statistics tab of Admin Preferences dialog box
- Click OK to complete the configuration.
You enable Server Health Monitoring from the Domino 6 Administrator client by clicking the Server tab, clicking the Monitoring tab, and then clicking the Start button in the top-right corner. This starts both Server Monitoring and Server Health Monitoring.
Figure 3. Start button
When you click the Start button, it changes to a Stop button, which you can use to stop monitoring.
Selecting the servers to monitor
Server Health Monitoring is integrated into the Server Monitoring interface that R5 administrators may already be familiar with. As with Server Monitoring in R5, when you initially start up Server Health Monitoring, the Domino Directory reads the list of servers to monitor by querying on Server Configuration documents. (In our environment in product development, the server list includes previously released Domino servers, from at least R3 up to Domino 6 servers. In this article, I'll focus on working with Domino R5 servers, but the same guidelines apply to Domino 6.)
Can you select which specific servers you want to monitor? Yes, you can! A new feature in the Administrator client lets you save a list of servers that you want to monitor in a saved statistics group profile. You create the new profile by modifying an existing profile (adding or deleting servers until you have the list of servers you want to monitor) and then saving the profile with a new name. This list can contain Domino 6 servers only, or a mix of both Domino 5 and Domino 6 servers. For example, I named mine R5Servers.
Figure 4. Saved statistics group
The profile that was the last selected becomes the default profile used when Server Monitoring is launched the next time. These group profile specifications are a great way to monitor sets of servers based on your needs and specifications, whether divided by regional area, functional areas, or time zones. The end result is that you can group one or more servers together for monitoring as a set, which is a practical, manageable, and scalable approach for administrators.
Off and running
When you click the Start button in the top right corner of the Server Monitoring screen, you're off and running with the Server Monitoring and Server Health Monitoring processes. Within minutes, you'll see information populating the screen:
- The Server Monitoring information appears in the main body of the screen.
- A new column for Server Health Monitoring appears on the left. The column label is Hea, for Health.
- The Hea column contains thermometer icons, which indicate the health status of the server. Green means everything is working fine. Yellow means that the component is being stressed but is not yet a problem, although it should be watched. If a red thermometer appears, one or more major components of that server needs attention; a function may not be working efficiently or not at all.
Figure 5. Server Health Monitoring display
At this point, a good deal of information has already been accumulated. Server Health Monitoring gathers detailed task information. This information is also rolled into the Server Health Monitoring analysis process and is reflected in the thermometer icon.
Server Health Monitoring also builds upon the information supplied through Platform Statistics, which are supported in R5 on the Windows NT, Sun Solaris, and iSeries OS400 platforms; and for Domino 6 on all platforms. Platform Statistics component areas supported in R5 include CPU, memory, and disk analysis. More components are supported in Domino 6 (for example, network information) and Server Health Monitoring adjusts accordingly for the different Domino releases. Analysis is performed on these metrics by Server Health Monitoring for the valid, stressed, and problem ranges; and the observations are factored into the overall value reported by the same thermometer.
So key server performance metrics, which are not often easily accessible or easily understood, are being tapped into and reviewed. Key Domino statistics are also checked for their values in feature areas such as mail routing, server responsiveness, and buffering. Decisions are made on where the values fall within the observed behaviors of running efficiently versus running stressed.
Let's take a step back and really appreciate what's going on:
- Server Health Monitoring knows which metrics to select and evaluate. The tool has knowledge about what the valid ranges are for these values across Domino's supported platforms. Think of all the articles, Redbooks, and presentations that tackle this subject. Most of this knowledge has been incorporated in the knowledgebase of Server Health Monitoring. It's operating on your behalf-you don't have to become an expert.
- Server Health Monitoring performs analysis over multiple monitoring intervals. This means the Server Health Monitor checks on these different component areas (Platform Statistics set, Server Monitoring set, and Domino set) on a regular interval basis (the default is one minute). Multiple monitoring intervals are reviewed before any assessments are made. Now here's a case where computer power is working to your advantage! Server Health Monitoring does not make decisions on spike conditions (single instances of a given value); it looks for a sustained or repeated pattern over multiple intervals.
- You've done nothing more than select the Start option to get the Server Health Monitor task going and reporting this information to you.
Diving down on server analysis
A quick look at the screen above shows red thermometers for servers Franklin, Houston, and Traffic. Let's dive down and see what the trouble is with these servers by looking at the Health Report.
From the Server Monitoring screen, right-click and choose Display Health Report:
Figure 6. Health Report view
The Current Health Reports view lists your Domino servers again, but this time, they are sorted by severity with red, critical conditions at the top; yellow, warning conditions next; and green, healthy conditions at the bottom. Here again, a lot of different information is delivered in a concise format. Let's take a closer look:
Figure 7. The server list details
Some servers also have a red icon on the right. This is a direct message to you, the administrator, that not all of the monitoring components are configured and more analysis could be performed on your behalf if these components were configured. Also, there's a Comment column (found on the far right and not shown in this illustration) which offers initial insights about the probable reason for the yellow or red alert condition. There are times when more than one component is raising the flag for concern, so an attempt is made to determine the originating component and have that listed in the comments section.
In this case, looking at the comments, you can see that memory seems to be playing a role in several of the servers. Perhaps it's time to review and upgrade system resources on these servers or to rebalance server loads. As you look at reports over time, you'll also notice if certain servers are frequently in trouble. This ability to get to know the "personalities" of your servers or to identify recurring problems is a major benefit of Server Health Monitoring. You get to know which servers are hot, which aren't, and which culprits are causing problems.
Tell me more
I'm wearing my administrator's hat, and I'm starting to put the pieces together. I'm already ahead of the game in knowing which servers to focus on, so how do I learn more? Clicking the twisties located to the left of the server name reveals more information:
Figure 8. Expanded server health report
With Domino 6, ten different components are monitored. For R5, the number is slightly reduced because not all of the same information is available. Also, keep in mind that the components that are present are the ones that are listed. In the example above, none of the sampled servers has the HTTP server component loaded and so the HTTP component is not listed. Also, Mail Delivery Latency is listed as a component, which means that the Mail Router is loaded.
Reviewing this information, I can see I need more detail about the mail routing behavior on Arista and memory usage on Franklin, as those components are listed in critical condition. Server Alice appears with only one yellow thermometer, for Memory Utilization, so that can be addressed after the more critical items are reviewed.
Be aware that one or more components may change colors, indicating new developments. Frequently, this is an added clue as to what is going on because one metric may have an impact on another.
Putting the picture together and finding solutions
There's one more level of detail you can go to for analysis, and at that level, another "world" opens up-one that includes recommendations for solutions.
Server Health Monitoring generates recommendations to problems based on your unique system characteristics and specifications. The Overall Health Report for a server provides a lot of information in one place, picking out the various analysis points that have been considered. To see a server's Overall Health Report, double-click the server listed in the Current Reports view.
Figure 9. Overall Health Report
Looking over the different sections, you can see that this is a "one-stop shopping" approach, where the necessary and often related information is provided to you. This is information used to make recommendations and also information that you can use to decide your next course of action.
In this report, notice:
- Under the Configuration Issues section, there's a listing for an action that you can take moving forward-in this case, enabling Logical Disk Performance Counters. Taking this action means that Server Health Monitoring will be able to monitor another system component. In this case, not monitoring the disk activity is a source of concern, especially with increasing amounts of traffic and increasing sizes of information that is shared and distributed.
- At a higher level report, you were told that this server was experiencing challenges in having sufficient memory to execute. More detail is given here, where you are told the total amount of memory allocated on your server as well as how much memory was observed as being available.
- There have been previous recommendations as to what is considered a memory challenged environment, and this system's available memory falls within this range (having 7.6 MB memory available). Note that this number represents a sustained condition. It was observed over multiple monitoring intervals; it doesn't represent a single observation.
- There are short-term and long-term recommended strategies. The recommendations are not listed in any specific priority order. Generally they appear in analysis order, as specific conditions are checked for each recommendation. Some recommendations will have a greater impact upon your environment and configuration; as each usage profile and design is unique. Only if those conditions are met (or not met) will a recommendation be displayed. As stated earlier, your specific configuration and specifications are taken into account when these recommendations are generated.
- The short-term recommendations are also tailored to your environment. When being told this server is low in memory, the next question is "how much should I adjust it?" That range of information is provided here, gauging how much you should adjust/grow your configuration.
- The long-term recommendations are likely to include a more involved process. The entries are all within your reach and are documented as to how to accomplish them. You will probably want to schedule time to analyze the information and decide how to execute the recommendations.
Take a step back at this moment and reflect how far you've come in such a short time period and with minimal effort. You've optimized your efforts by not having to perform this monitoring. You've been precisely guided to the areas that need more attention. And you've also been given a game plan for approaching problem situations, based on techniques that we, on the Domino Performance Team, recommend.
How are people using Server Health Monitoring in their work? I really enjoyed the scenario presented by another Domino administrator. She had positioned the Administrator client on her "flight pattern" so that at any time, she could glance at the Server Monitoring screen and know when there was an issue to dive on. She also made sure to review the Server Monitoring display at the start of her day, to get an initial forecast of what was coming or what was waiting for her to attend to.
Key points to remember
The following key points are the result of the Domino Performance Team's own deployment scenarios within development and other IBM teams, as well as from the questions and feedback received from public forums such as Lotusphere, DevCon, Admin 2001, and Tivoli-centric events:
- Server Health Monitoring will monitor Domino 6 and R5 servers on all platforms.
Monitoring processes analyze key metrics and generate recommendations when appropriate. Even if Platform Statistics are not supported on a released platform, Server Health Monitoring will collect, analyze, and work from the Domino statistics. Helpful and valuable information can still be analyzed from this set of information.
- Server Health Monitoring does not impact system resources on your servers.
Server Health Monitoring runs from and analyzes in the Domino 6 Administrator client. It does not impact server performance.
- Server Health Monitoring is part of a new feature set offering through Tivoli called IBM Tivoli Analyzer for Lotus Domino.
To enable Server Health Monitoring, you need the Domino 6 Administrator client, even in R5 environments. In addition, you must install Tivoli Analyzer for Lotus Domino. You will need to purchase a separate CD from Tivoli to enable the user interface components of Server Health Monitoring.
- Server Health Monitoring is independent of any other components or products developed by Tivoli.
Server Health Monitoring is integrated into the Domino 6 Administrator client. It does not require any other components from Tivoli (other than Analyzer for Lotus Domino). It complements other Tivoli software, if that software is installed in your environment. Server Health Monitoring focuses on analysis of individual Domino servers, while other Tivoli software will focus more on your whole environment, encompassing one or more servers.
- Server Monitoring and Server Health Monitoring are geared to monitor Domino servers in groups.
They do not necessarily monitor your whole enterprise environment, especially if it is very large. We've found that administrators can successfully monitor groups of servers with 50 to 75 servers, from a single Administrator client. This means that you can and should think in terms of how you can organize your server monitoring group setup. From a user interface viewpoint, we've seen administrators in our labs successfully monitoring one to three screens of information on the Server Monitoring display from one Administrator client.
- With Server Health Monitoring, there's less pressure on the administrator to be an expert on critical system analysis.
Server Health Monitoring is a huge relief for Domino administrators who want or need to focus on activities other than server monitoring and analysis. There's less pressure to learn the details about different operating system statistics (including CPU, memory, and disk I/O) and, with Domino 6, network statistics. Also, Domino statistics and options are reviewed on a regular basis, and through the Server Health Monitoring interface, with information as to what the values are set at. Administrators can be confident that the analysis work is being performed and performed properly.
- Server Health Monitoring can teach administrators which key points to monitor.
The Server Health Monitoring interface, starting with the Current Reports view details, can actually teach you the key points that should be monitored, the valid ranges to consider, and what your current values are. It acts as a tutorial on what to review and what to look for.
In conclusion, I hope I've presented some new perspectives and solutions for you in your environment. The great news is that you can get this effort going now, and reap the benefits of Server Health Monitoring. There's also a lot more to this interface than this article describes. Server Health Monitoring's power also increases when using Real-Time Charting or as part of a Historical Charting analysis effort. But, we'll save those discussions for another time!
A special request
At this point, I'd like to raise a challenge to you! We can work together on making sure the Domino Server Health Monitoring knowledgebase for analysis and recommendations is as complete and accurate as possible. When using this tool, if there are problem scenarios that are not being detected by the tool, we need to know. Likewise, if there are success cases and different scenarios for usage, we also want to know. Please send us information and let us know. We'd be looking for a copy of the dommon.nsf file, where the most recent health information is stored as well as a copy of the statrep.nsf file from the Administrator client, where the statistic information is stored. Additionally, including the following switch setting within the Notes.ini file of the Administrator client will store a lot of useful additional analysis information to dommon.nsf: REDZONE_SAVE_AFTER_EVAL=1. Send this information to firstname.lastname@example.org and put "I'm a Server Health Monitor user" in the Subject line.
And special thanks
Special thanks to Lynda Urgotis, our User Assistance Specialist, for all her help, support, and enthusiasm around every writing project she managed and delivered, including her assistance with this article. She's able to make every project that she gets involved in a success.