Client and server crashes are something that everyone wants to avoid, but when they occur, your top priority should be to obtain the necessary data to prevent reoccurrence and to recover quickly. This article discusses some of the new serviceability features added to IBM Lotus Notes/Domino 7. The article is intended for Domino administrators who want to obtain the proper crash-related data to send to IBM Support and users who are curious to know what happens when their Notes client terminates abruptly.
Some background on fault recovery and automatic diagnostic collection
Fault recovery has been around on the Lotus Domino for UNIX platforms since release 4, but it was not documented and was somewhat difficult to work with. As of Lotus Notes/Domino 6, fault recovery works on all the platforms on which Lotus Notes/Domino runs and is easily configurable in the Server document. Fault recovery tracks all the resources that are in use by the client or server and, in the event of a crash, cleans them all up. For the Domino server, there is the optional feature to restart it after the diagnostic information has been collected. For users, this means that they can restart the Notes client after a crash without having to terminate all the processes that had been started by the Notes client.
With fault recovery, Lotus Notes/Domino can gather the proper diagnostic information on every crash. However, each machine on which the crash occurs you need access to this information, which can be problematic for Notes client crashes. The automatic diagnostic collection (ADC) feature was added in Lotus Notes/Domino 6.0.1 to remove the requirement to gather the data from each machine. ADC enables you to set up a mail-in database to collect the diagnostic information generated from the Notes client and/or the Domino server crashes in one central repository.
You can configure ADC through Server Configuration documents (for servers) and Desktop Setting policies (for clients). The data is sent to the mail-in database when the client or server restarts. For client crashes, you can configure the Policy document to reflect whether to allow the user to choose to send in the diagnostic information or to allow the user to enter comments about what he was doing prior to the crash.
New features added to ADC in Lotus Notes/Domino 7
The ADC functionality has been extended in Lotus Notes/Domino 7. In this section, you learn about the new functionality and how it can help you collect better diagnostic information and also spot trends in outages over time. The new features are as follows:
- Configuration of files to collect at the time of a crash
- Extra information collected for Sametime and QuickPlace crashes
- Server restart notification to Domino Domain Monitoring (DDM)
- Extended logging choices
- Expanded information in email notifications
- Limiting attachment size
- Warning if NSD is not configured to be run
- Fault analyzer
Configuration of files to collect
In Lotus Notes/Domino 6.x, ADC collects a hard-coded list of files that includes the NSD output, console log, and memory dump. This is usually enough information to debug standard crashes, but what if there are third-party applications running on the client or server that have their own log files? Or suppose you want to capture the SEMDEBUG.TXT file because there was contention on a semaphore that led to the outage?
Lotus Notes/Domino 7 expands the number of files captured by default, and you have the option to collect additional files in the event of a crash. The files collected by default are:
- NSD output. A dump of system information and what each thread in all Notes/Domino processes were doing at the time of a crash.
- Console output. A text file containing messages that were written to the Domino console.
- Memory dumps. A snapshot of which memory blocks were allocated by Notes/Domino processes at a given time.
- Notes_Child_PID output. The Process ID of each process when it is created along with its exit status when it terminates.
- Memcheck errors. Any errors detected by the memcheck program when scanning the Notes/Domino memory pools.
- Semaphore debug. A text file showing the semaphores that were contended for.
- HTTP session/thread logs. Detailed information about what each HTTP session and thread were doing. This is collected only if the crash occurs on HTTP and session/thread logging is enabled.
- NSD-sysinfo output. The system information that is recorded at client/server startup time.
- NSD-kill output. The information collected when nsd -kill is used to terminate the client or server.
In addition to these files, you can also collect any other files that may be present at the time of a crash by completing the information in the Server Configuration document or the Desktop Settings Policy document. If you click the Diagnostics tab on either of these documents, you see a section like that shown in figure 1, which uses the Domino 7 Directory design.
Figure 1. Diagnostic Collection Options section
In the Diagnostic file patterns field, you can enter a list of file names and/or patterns to search for at the time of a crash and attach the information to the document that is sent to the mail-in database, provided the configured size limit is not exceeded (see the "Limiting attachment sizes" section later in this article for information on size limits). The patterns can contain an asterisk ( * ) to signify one or more characters or a question mark ( ? ) to signify one character.
If the file resides in the current diagnostic directory (usually <data directory>/IBM_TECHNICAL_SUPPORT), then you only need the file name. If, however, the file resides outside this directory, you need a full path to the file.
Extra information for Sametime/QuickPlace crashes
If a crash is in either the Sametime or QuickPlace server, additional information is gathered by default and added to the document sent to the mail-in database. Specifically, if the crash is in Sametime, the stdiags_* file is automatically added to the mail-in document. If it is a QuickPlace crash, Lotus Domino attempts to attach the qpconfig.xml and admin.nsf files to help diagnose the problem.
Server restart information posted to Domino Domain Monitoring
Domino Domain Monitoring (DDM) is a new feature in Lotus Domino 7 that provides administrators a high-level view of how their entire domain is performing. When a server restarts after a crash, a message is posted to DDM, so you can track the event at the administrator console. It also provides a historical tracking of these events across the domain.
Extended logging choices
In Lotus Notes/Domino 6.x, if console logging is not enabled on the client or server, then no console information is captured in the mail-in database. In certain cases, however, this information is necessary to track down the cause of a problem, so Lotus Notes/Domino 7 makes a greater effort to obtain the information.
Regardless of whether or not console logging is enabled, the first place ADC looks for console information is the Domino Controller logs (on the server), if the server runs under the Domino Controller (also known as the Java Controller). If the server is running under the Domino Controller, ADC uses its logs (still respecting the size limits in the Configuration document) as they are a superset of the information in the console output; that is, the logs contain the messages plus severity codes for each message.
If ADC runs on the client, or the server is not running under the Domino Controller, ADC checks whether or not console logging is enabled. If so, it uses the console output just as it does in Lotus Domino 6.x.
If console logging is not enabled, ADC falls back to using the information written into log.nsf (which is a subset of the information contained in the console log). It does this by extracting the text from the time of the crash backward from the Miscellaneous Events view in log.nsf and creating a pseudo-console output file. This file is then attached to the document just as the console log would have been. The reason the entire log.nsf is not attached to the document is that, in most cases, it would be larger than the configured size limit, and because log.nsf is a binary file, you cannot simply truncate it and send in the last part.
Expanded information in email notifications
In Lotus Domino 6.x, there is a configuration option to send a notification to a person or a group when the server is restarted. However, this notification provides only basic information, such as when the server crashed and the fact that it is now back up (see figure 2).
Figure 2. Domino 6 server restart notification
In Lotus Notes/Domino 7, this notification has been enhanced to contain more information in both the Subject and Body fields, so those on the notification list can get more information about the crash more quickly. As you can see in figure 3, the Subject field includes the Domino version, the process that crashed, and the time of the crash. The Body field contains nearly all the information that was sent to the mail-in database (except for the attachments) along with a database link to the mail-in database (if it resides in the same domain as the server that crashed).
Figure 3. Domino 7 server restart notification
With this expanded Subject line, users of handheld devices or pagers who are on the notification list can get the information they need without having to go to the mail-in database. If you prefer the 6.x format, you can use the Notes.ini parameter ADC_USE_OLD_EMAIL_FORMAT=1 in the Notes.ini file of all the servers you want to send the 6.x-style messages.
The Notes.ini parameter FR_ATTACH_NSD is still respected in Lotus Domino 7. This parameter determines whether or not the NSD output file is attached to the notification message that is sent when a server restarts. If the parameter is set to 1, then the NSD output is attached to the notification message. If it is a value greater than 1, then the NSD output is attached only if it is less than or equal to the parameter's value in kilobytes (thus FR_ATTACH_NSD=5 means attach it only if it's less than or equal to 5 KB). This setting applies only to attaching the NSD output to the notification message and has no bearing on what is attached to the message sent to the mail-in database.
Limiting attachment sizes
One of the problems with adding attachments to the message that ADC sends to the mail-in database is that it greatly increases the size of the message. This can cause problems if there is not enough disk space to transfer the message or if your company uses router controls to limit the size of messages sent in the environment.
In Lotus Notes/Domino 7, you can limit the total size of the message that is sent to the mail-in database along with how much of the NSD output (this is not related to the INI parameter mentioned in the last section) to attach in the Diagnostic Collection Options section (see figure 4).
Figure 4. Maximum size fields in the Diagnostic Collection Options section
Figure 4 shows the defaults of 5 MB for the total message size and 2 MB worth of NSD output, which is read starting from the beginning of the file, so it's the first 2 MB of the file that is used.
When ADC is building the message, it attaches files in this order:
- NSD output (up to configured size)
- Console output (up to configured size)
- Diagindex.nbf file, which contains a list of all diagnostic files generated since the server last restarted
- All other files, including other default files and files you configured to be collected
If the NSD file needs to be truncated or if any of the files cannot be attached to the message because they will exceed the limit, they are listed on the mail-in document so you can retrieve them manually. Figure 5 shows an example from a server crash.
Figure 5. Example ADC message for a server crash
For client crashes, you get the additional information about which Desktop Setting policy was in effect (see figure 6).
Figure 6. Example ADC message for a client crash
Warning when NSD is not configured
ADC relies on the NSD output for much of its information, so if NSD is not configured to run at the time of a crash, there is little useful information in the ADC report. Specifically, there is no information for IBM Support to use to help resolve your issue. To alert administrators of this, the following warning displays in the mail-in database if no NSD output was found at the time of the crash:
WARNING: Crash information not extracted! Make sure NSD is configured to run on the Server document.
The largest new feature for ADC in Lotus Domino 7 is the fault analyzer. The fault analyzer is a new Domino server add-in task that can match crash reports in the mail-in database to the ADC that sends the information.
Fault analyzer can handle many mail-in databases (for example, if you have different databases for client and server crashes), but it looks for matches only within the database that it is currently processing; it does not look across multiple databases for a match.
If fault analyzer does not find a match for a new crash, it is designated a parent crash (meaning this is the first time this crash has been seen). If it finds a match, then it is designated as a child crash, and the first occurrence of this crash becomes its parent. The parent/child relationship is discussed later in this article, but for now, let's discuss the process used to determine whether two crashes match.
Determining whether two Notes/Domino crashes result from the same root problem is not always an easy task. Because of the layered architecture of Lotus Notes/Domino, you cannot rely on the error message alone to uniquely identify a problem because there could be different code paths leading to the same error message.
Fault analyzer uses the sequence of functions called by the thread that crashed to develop a unique signature for the crash. This sequence of function calls shows you the path to the crash.
If you've looked through NSD output files on various Notes/Domino platforms, you've seen that the various operating systems display the call stack information quite differently. ADC takes care of normalizing these differences before sending the report to the mail-in database. This includes making sure functions are listed in reverse chronological order and that C++ functions are de-mangled (meaning they are in <class>::<function> format).
Fault analyzer starts its matching process by reading from the top of the stack downward until it comes across a function name that doesn't match between two stacks. The top of the stack is determined by the Panic function, which is the first function on the stack for access violations on Windows 32 platforms or the first function after fatal_error on UNIX platforms.
Fault analyzer walks through each function in stack1 and looks for a match at the same position in stack2. This process is repeated until the function names do not match. When this happens, a percentage match is determined between the two stacks. The matching percentage is determined by the number of functions the stacks have in common divided by the average number of functions in the stacks.
If stack1 has 10 functions and stack2 has 15 functions and the first five functions match between the two stacks, the matching percentage is 5 / ((10+15) / 2) = 41 percent.
If two stacks have all functions in common, then their matching percentage is 100 percent, and they are called an exact match. It's not difficult to determine that these two crashes have the same root cause. However, it is possible for two crashes to have the same root cause, but not have identical call stacks. This can occur because Lotus Domino is written in a layered model whereby various subsystems build upon one another. If there is a problem in the layer that accesses the databases (NSFs), then many other subsystems (for example, router, replicator, and so on) may all hit the same root problem, but have different call stacks leading to the problem.
To determine whether or not two stacks are a partial match, we use a percentage cut-off based on the average stack length of the two call stacks. The average stack length is determined by the number of functions in stack1 plus the number of functions in stack2 divided by 2. The cut-off percentages based on average call stack length are shown in the following table.
|Call stack length||Cut-off percentage|
|< 4||Must be an exact match. With so few functions in the stack, it is too risky to call it a partial match, if only a subset of the functions are the same.|
|4||Must have a percentage match of 75 percent or higher.|
|5 to 7||Must have a percentage match of 60 percent or higher.|
|8 or more||Must have a percentage match of 40 percent or higher.|
As the average stack length increases, the percentage needed to be deemed a partial match decreases because, in longer stacks, as there are more functions in common at the top of the stack, you can feel confident that the problems are the same.
You can manually override this algorithm by setting the parameter FAULT_ANALYZER_MATCH_PERCENTAGE in the Notes.ini file on the server that runs the fault analyzer. You can set this to a number between 1 and 99. If you use this setting, it applies to all partial matches, regardless of the average stack length.
Fault analyzer configuration
Fault analyzer is turned off by default. You need a Domino 7 server to host your mail-in databases in order for fault analyzer to run on them. Also, the Domino Directory must be upgraded to the release 7 design so you can see the new fields on the Server Configuration document to configure fault analyzer on the Diagnostics tab (see figure 7).
Figure 7. Fault analyzer fields in the Server Configuration document
By default, if you enable fault analyzer, it runs on all mail-in databases on a server and is launched during server startup (there's no need to add it to the ServerTasks line in the Notes.ini file). Fault analyzer determines which databases on a server are ADC mail-in databases by scanning the local Domino Directory for Server Configuration and Desktop Setting Policy documents that have mail-in databases defined. If any of those databases are on the local server, they are monitored. If no databases are found on the local server, then fault analyzer shuts down.
When fault analyzer starts up, it processes any new crashes in the mail-in databases, and then remains idle until a new crash is delivered. At that point, it immediately runs and processes the new crash.
You can also configure fault analyzer to run on selected databases (see figure 8). You may choose this option if you want fault analyzer to run on only a subset of the databases on the server or if you want to replicate all your mail-in databases to one server, so they can be processed in one place instead of running fault analyzer in multiple places.
Figure 8. Fault analyzer set to run on selected databases
You can also run fault analyzer manually by issuing the console command
load faultanalyzer. If you do not pass in any arguments, fault analyzer uses the method described previously to find all the ADC mail-in databases on the current server. You can also pass in a parameter that specifies a database, directory, or indirect file (*.IND that contains a list of databases) for fault analyzer to process. When fault analyzer is loaded manually, it processes any new crashes in the databases, and then terminates; it does not remain idle waiting for the next crash to be delivered.
To save space in the mail-in database, you can also use an option to remove attachments from child crashes (both exact and partial). The idea being that you know it's the same crash, so there's no need to store the diagnostic information twice. The data is still available on the client or server if you choose to remove it from the mail-in database.
NOTE: If you run ADC in Lotus Domino 6.x, when you upgrade to Lotus Domino 7 and enable fault analyzer, it processes all your current entries in the mail-in database the first time it runs.
Changes to Notes/Domino Fault Reports database
The Notes/Domino Fault Reports (lndfr.ntf) template has been updated for the Domino 7 server so that you can view all the new information that is present in the ADC mail-in database. Current mail-in databases are upgraded to the new design when you upgrade the server to release 7 and the design task runs.
Figure 9 shows the new outline of the database. Standard and Fault Analyzed views have been added to the By Date, Clients, and Servers views so that you can see the parent/child relationships as well as the current by date categorizations.
Figure 9. Outline of the Domino 7 Fault Report database
In the new Fault Analyzed views, the child is categorized under its corresponding parent crash. When a parent crash document is opened, there is an embedded view at the top that shows any children that are associated with it. Double-clicking opens those documents directly from the embedded view. Exact and partial matches contain doclinks to the parent crash, so you can navigate using those as well. In addition, the partial matches display the matching percentage, so you can see how closely the two stacks match.
The occurrence count for each parent crash is the total number of times the crash has been experienced in the current mail-in database. This count is listed in the Administrative Section of the parent crash document (see figure 10) and is also displayed in all the views in the mail-in database.
Figure 10. Occurrences field in the Administrative Section of parent crash document
Unique ID count
In addition to the total occurrence count, fault analyzer also tracks the number of unique clients and/or servers that have been affected by a particular crash. This can be helpful in determining weighting for a problem. For example, if you experience a crash 20 times by 20 different users, that may be more serious than 20 crashes by the same user. The unique IDs are listed in the Administrative Section of the parent crash document in descending order according to the number of times they've experienced the problem. You can also see this count in the views in the mail-in database.
After you've determined the root cause of a crash and you've upgraded to a release containing the fix or applied a corrective hotfix, you do not want any future crashes with the same stack to be marked as children (because the crash should have been resolved).
To have future matching call stacks treated as new parents, you can mark a crash resolved in the Administrative Section of the parent crash document. Resolved crashes display with a green checkmark next to them in the views.
When marking the crash resolved, you can either indicate that all clients/servers have the fix, or you can enter a list of Notes/Domino versions and/or hotfixes that contain the fix (see figure 11). Fault analyzer assumes that, if a problem is fixed in a release, then it is fixed in subsequent releases (this applies to Notes/Domino releases and hotfixes, that is, combo hotfixes). Fault analyzer also takes into account the relationship between releases 6.x and 6.5x; thus, if something is marked as fixed in 6.0.4, it is considered fixed in 6.5.2 as well.
Figure 11. Editing a parent crash to mark it resolved
You've learned about some of the enhanced serviceability features added to Lotus Notes/Domino 7 that enable you to better handle client and server crashes. Now is the time to upgrade to Lotus Notes/Domino 7 and to look to the horizon for the next release of IBM Lotus Notes, code-named Hannover!
- developerWorks Lotus article, "Troubleshooting Lotus Domino hangs and crashes"
- developerWorks Lotus article, "New features in Lotus Domino 7.0"
- developerWorks Lotus article, "New features in Lotus Notes and Domino Designer 7.0"
Get products and technologies
- Download the Lotus Domino 7 trial from developerWorks.
- Download the Lotus Notes 7 trial from developerWorks.
- Participate in the discussion forum.
- Participate in developerWorks blogs and get involved in the developerWorks community.