Hoping we don't tempt fate with our timing, on Friday the 13th of this month, we quietly turned on an automated fault analysis capability for Notes System Diagnostics (NSD) files uploaded to our Technical Support file repository, called ECUREP (for Enhanced CUstomer REPository). In other words, any time an entitled customer uploads an NSD file, our systems will automatically - without delay - perform an analysis to determine what type of incident is reflected (crash, hang, out-of-memory condition, user killed processes, etc), and for a crash whether the crash stack contained in the NSD file matches any known problems in our database. In cases where a customer encounters a problem already seen & solved elsewhere, the system will be able to point to the known defect and the associated technote. In cases where the crash stack contained in the uploaded NSD file does not match any known problems, no result is returned, but per standard Support process, a new defect is opened to track the further analysis. The Support Engineer along with Development may manually apply other internal tools like MemCheck or Laza, to analyze the incident. Fault Analyzer has shipped with the Domino product since version 7.0 to process data captured with the Automated Data Collection (ADC) feature. Local analysis at the customer site can determine general disposition, or incident type, but local analysis won’t match the crash stack against our in-house database of known issues. That database is comprised of all NSD submissions to the ECUREP system, plus similar data captured in IBM’s internal worldwide environment with over 400,000 employees.
The new automated support analysis leverages the same Fault Analyzer tool available in Domino itself, and runs against our latest database of known issues. It can handle compressed archive files in zip, tar, tar.gz, tar.bz2, ar, jar, dump and cpio formats up to 225 MB in size. There is no logical limitation determining the 225 MB cutoff; it's a cautionary, self-imposed limit we have set to avoid slowing down related processes. Once we get a sense of how the analysis system operates, we may alter the limit. The system offers several key advantages. From a customer perspective, a first possible answer is returned much faster in cases, where the crash stack signature is known. From a vendor perspective, it provides our engineers a quick first analysis of the diagnostic data. The system 'stamps' the information into the Problem Management Record (PMR) visible to the entitled customer via the Service Request tool on the Web. This helps keep all information related to the customer's issue in one central thread visible to both the customer and the support representative.
Given that we have just launched this automated use of the Fault Analyzer in our support process, we fully expect that we will have opportunities to tune and improve the process as we learn from initial submissions and analyses. A key design concern has been – and continues to be - minimizing false positives. Returning an incorrect defect match could potentially waste time for both our customer and our support representative, so to start with we have set match criteria that we believe are specific enough to minimize false positives, but we continue to review and tune the algorithms. As we learn from the initial submissions, we will look for ways to refine the match criteria to allow more submissions to find a match, but only in cases where we can assure ourselves the identification can be done with sufficient accuracy. Experience from the first couple of weeks show that less than half the NSD submissions find a match. However, finding matches for 100% of the submissions is not our success criterion for automated fault analysis. New problems, e.g. from interaction with newly released 3rd party components, will obviously have no matches the first time they are submitted by any customer. If we were able to find matches for all submissions, it would mean that all problems were known. And that would mean either that we were terribly lagging in delivering maintenance releases, or that our customers were terribly back level in applying the maintenance. So to improve fault match identification, we're not focused on achieving matches in a specific percentage of cases, but rather on identifying those additional circumstances (crash stack specifics) that allow us to positively match with additional known issues and extend our logic to cover those circumstances as well.
I hope you agree that with the new automated Fault Analysis, we have taken yet one more step to provide more efficient support to our customer base. Crashes, hangs and resource exhaustion should be rare events, but when they do occur, rapid problem identification is key to minimizing business impact.