IBM Support

IBM Power FAQ: What to expect when requesting Root Cause Analysis (RCA) (applies to AIX, HMC, IBM i, PowerHA, PowerVM, Power Systems)

Question & Answer


Question

Root cause analysis (RCA) is an assessment for determining the sequence of events which have contributed, or can contribute, to the occurrence of issues. The main contributors to issues include simple user actions  (for example, invalid command usage, incorrect configuration), lack of planning and testing (for example, for new implementations or changes to an implementation), configuration of products or devices outside of the operating system environments, custom client configuration in conjunction with their unique business workload, and one or more complex and microsecond events in a mix of millions of events across multiple devices and systems.
This document provides answers to commonly asked questions to help clients have the correct level of expectations when working with IBM Remote Technical Support for requests of RCA. IBM strongly recommend that client administrators and management thoroughly review this document and the materials referenced to fully understand the scope of support, definitions of business impacting events, and how requests for RCA are handled.

Cause

Answer

IBM Remote Technical Support is staffed and trained to provide quality and timely technical support on IBM's most current products and release levels. 
Details about IBM's "End of Service" policies are available on the IBM Support Lifecycle Policy FAQ page. Current product lifecycle details ("In Service" and "End of Service" dates) can be viewed on these product websites: 
Hide/Show All Answers

Question: What is the difference between requesting a problem solution (or resolution) and requesting RCA?

Answer:
Request for a problem solution (or resolving a problem) is often confused and misrepresented with requests for RCA, which leads to the incorrect expectation and delays in working toward a problem solution. Unfortunately, too many clients tend use the incorrect term "RCA" instead of the correct term "resolution".
The task of working toward a problem solution includes the steps necessary to resolve or bring stability for the reported issue.
RCA, as previously explained, is an assessment after the systems and applications are restored to a working operation, to understand the sequence of events that lead to the issue.
The primary goal for IBM Remote Technical is to provide a problem solution, and if a solution is not available, a valid work around to the reported problem.
When opening a case with IBM, it is strongly recommended not to use the phrase RCA as a substitution or in lieu of asking for a resolution to the reported problem.
🔗

Question: Is RCA provided after business hours and on weekends?

Answer:

In today's global environment, the following factors have to be considered when providing support:
  1. The country where the support contract was created.
  2. The country where the machine is location.
  3. The countries where the system administrators work.
For IBM, the serviceability of the software maintenance agreements (or contracts) is associated with option "a", the country where the support contract is created. Therefore, business (also known as, daytime hours) and weekend hours are also specific to the country (and location within that country) where the contract was created.
The priority of IBM Remote Technical Support is to assist customers to identify and resolve defects and restore systems back to an operational statement. When clients open cases after daytime business hours (specific to the country where the support contract was created for the machine(s)), the focus will be to resolve critical business issues. Once the critical business issue has been resolved or the systems are once again operational, requests for RCA may be postponed until the next weekday (for example, Monday - Friday) and business hours (as known as, day time hours).
🔗

Question: Is RCA included with standard software maintenance (also known as, support) agreements?

Answer:
The IBM software maintenance agreements do not explicitly state that RCA is included within scope of supported activities. As a reference, we encourage client administrators and management to review the IBM published documentation that explains the guidelines of what support topics are and are not within the scope of IBM Remote Technical Support, how case severity can be applied to situations, and how cases will be worked based on the "current" business and operational impact to the environment.
"How technical questions (Q&A) are handled"
https://www.ibm.com/support/pages/node/796206
 "IBM Support General Guidelines and Limitations"
https://www.ibm.com/support/pages/node/740855
 "IBM Enterprise Support and Preferred Care options for System Storage and Power Systems"
https://www.ibm.com/support/pages/node/738889
 "IBM Enterprise Support and Preferred Care Severity Definitions"
https://www.ibm.com/support/pages/node/739151
🔗

Question: Is RCA considered a severity 1 and/or system down situation?

Answer:
Per the definitions provided by the IBM Support Guide:
"IBM Enterprise Support and Preferred Care Severity Definitions"
https://www.ibm.com/support/pages/node/739151
Severity 1 cases are applicable for situations when applications and systems are "currently" inoperable.
RCA is a phase of the problem-solving activity where applications and systems are "currently" operable (may be operating in a diminished state). Therefore, by definition, RCA is not a severity 1 activity.
Although RCA may not be fit within the definition of a severity 1 activity, IBM fully understands the need for clients to understand the underlying cause for severe business impacting events and their need to implement procedures to prevent future occurrences. When RCA is identified during the process of resolving the reported issues, RCA will be provided. During a problem investigation, when the root cause is identified as an incorrectly configured environment, lack of adequate performance or configuration tuning and testing, or a known / new defect, that will be considered the root cause of the issue(s).
If the reported issues are specific to products and features outside the scope of IBM Remote Technical Support or is specific to clients\' custom workloads and solutions, the clients may be referred to an IBM Services team or an IBM Business Partner to continue their RCA exercise through a fee-based engagement
🔗

Question: What is IBM's SLA (Service-Level Agreement) or Responsive Time Requirement for providing RCA?

Answer:

Per the IBM Support Guide, there is no SLA or Response Time Requirement when or if RCA can be provided.
When you consider commonly reported problems:
  1. A simple interpretation of application, process, command, or system messages.
  2. One or more events which occur during the same period and can be re-created.
  3. One or more events which occur during the same period and cannot be re-created.
  4. An extremely evasive, low-level event that occurs within millions of other events across multiple devices.
The effort, timeliness, and ability to provide RCA is dependent on a combination of factors which include:
  1. Release levels of the products and features involved with the solution.
  2. Complexity of environment (more hardware and software features exponentially increase complexity)
  3. Frequency of the issue occurring (higher frequency increases opportunity to capture useful data).
  4. Ability to clients to capture the correct data at the time the original event occurs, not after the event.
  5. Need to collect, analyze, and correlate data from multiple applications, systems, and devices at the time of (or leading up to) the issue.
  6. Uniqueness of the workload to the client specific environments.
  7. Type of issue (for example, crash, hang, error message, unpredictable behavior, timeout, performance).
  8. Clients following the troubleshooting procedures provided by the IBM Remote Technical Support.
  9. If the issue occurs with our product, with another vendor product, or custom solution.
By default, and for performance reasons, systems and applications are not configured with the maximum level of trace or debug options. As a result, and when needed, IBM Remote Technical Support will provide more in-depth troubleshooting options to be implemented while the issue reoccurs. For many issues, to identify problems and provide RCA, the process will be an iterative (repeat) process of implementing troubleshooting options, re-creating the issue, collecting data, reviewing the data, fine tuning the troubleshooting options. Clients who do not following the troubleshooting procedures, who do not collect the requested and required data at the time the issue occurs, will contribute to delays and/or the ability of IBM Remote Technical Support to identify and resolve problems.
🔗

Question: What are the common types of RCA?

Answer:

The common types of RCA are:
  1. A factual assessment
    An assessment based on facts provided by the data provided by clients. This assessment can only be provided when clients provide the complete set of data requested by IBM Remote Technical Support. Only when a complete set of data has been provided, will IBM Remote Technical Support be capable of explaining the sequence of events leading up, during, and after the event occurred. A complete set of data may include a single data file or may require several iterations of data collections, based on the complexity and frequency of the issue.
  2. A hypothetical assessment
    A reasonable effort assessment when no data or incomplete data is provided by clients. This assessment will be provided when clients are not able to re-create the reported issue and/or is unable to provide all the requested data. Being a "reasonably effort" assessment, IBM Remote Technical Support will be limited in the time and effort providing the assessment.
  3. No assessment
    In some cases, and due to lack of data and information provided by the client, low frequency of occurrences of the issues, and/or the issues are related to events with products or features outside the scope of coverage for the support team products, IBM Remote Technical Support may not be capable of providing either a factual or hypothetical assessment.
In scenarios where IBM Remote Technical Support provides evidence that the issue is not related to the supported products or is related to other products, IBM Remote Technical support may refer clients to an IBM Services team or an IBM Business Partner for a comprehensive and customized review and assessment.
🔗

Question: Is RCA provided for "End of Service" products?

Answer:
IBM Remote Technical Support is staffed and trained to provide quality technical support on IBM's most current products and release levels. It is important that clients with "In Service" version of products, or qualifying service extensions, get the utmost priority to ensure continued operations of their business-critical applications. For this reason and for IBM clients that have a business requirement to continue to use "End of Service" products or are unable to upgrade to an "In Service" level, IBM Remote Technical Support may refer clients to the IBM Services for continued support of "End of Service" products. IBM Services is uniquely positioned to help IBM clients through a full IT infrastructure lifecycle of strategy and planning, architecture and design, and implementation and optimization. IBM Services fees will vary based on the scope of assistance requested.
Clients who purchase product service extensions can receive limited support for "End of Service" components (for example, old versions of software or outdated hardware). For these scenarios, the terms and conditions discussed in this document will apply.
Clients who continue to use "End of Service" products without a service extension may be referred to their IBM Account Representative, an IBM Services team, or an IBM Business Partner Clients to discuss the purchase of the service extension or to receive assistance for the out of service products. When possible, and as a courtesy, and without a guarantee, IBM Remote Technical Support may provide a reasonable effort to provide an action toward for the resolution of the reported issue.
🔗

Question: One or more snap or perfpmr data collections were uploaded. Why can IBM not identify the problem and/or provide RCA?

Answer:

A factual assessment can only be provided when clients have provided all the required data that captures adequate details leading up to and at the time the issue or problem event occurs.
A hypothetical assessment may be provided when no data or incomplete data is provided by clients, the data is not collected leading up to or at the time of the event, or the issue with products beyond the scope of IBM Remote Technical Support.
Common scenarios when IBM may not be capable of providing an assessment include:
  1. Issue was one time event and could not be re-created.
  2. The data provided did not capture enough historical information to come to either a hypothetical or factual assessment.
  3. The data was captured at a time when the event did not occur (for example, before or after the event).
  4. The issue is related to events occurring outside the supported products.
  5. The issue is specific to the clients\' custom workload or custom solution.
🔗

Question: Why does IBM keep asking for more and more data?

Answer:

Most modern computer environments consist of many connected computing, network, and storage devices, creating extremely complex solutions or infrastructures. This complexity exists whether these devices are provided by one vendor or multiple vendors. Complex issues occur when one or more microsecond events occur within a sequence of million events based on client's unique business workload being processed through these client custom configured and multiple layers of connected devices. For performance reasons, all debug and tracing features can not be enabled across all the connected devices in order to capture any, or all, microsecond events that may trigger a specific issue.  There are an infinite number of possible execution paths across all connected devices trying to service unique client workload to implement a complete or comprehensive hooks into all products. Therefore, the default data collection procedures are not likely to capture one of these microsecond events (which likely occurred before the data collection) and simply display a nice message such as "PROBLEM OCCURRED HERE".
For these reasons, the IBM approach has been, and will continue to be:
  1. Get an accurate problem description or impact to the business and get a complete and accurate details of any recent changes to the configuration, the environment, the infrastructure, and workload. This information will help the support team to isolate the focus to specific products and/or specific features in the products. Many initial delays are contributed to clients not providing accurate or complete details when the case is first opened.
  2. Review the initial data collection to understand if the problem may be related to a known defect, to understand the configuration and understand if the configuration could be a contributing factor and review recent error messages. In some situations, the problem can be identified from the initial set of data. In more complex situations, the initial data will provide background information and help to further identify how to implement a targeted data collection. Many delays are contributed to clients not uploading the correct data, uploading more than the requested data, uploading data other than the requested data, and uploading incomplete data, or uploading data after a significant amount of time after the event occurred (for example, capturing data after having rebooted a system).
  3. For complex issues, performance issues, in-frequent issues, time-out issues, data corruption issues, clients should expect multiple data collections (when the issue does occur) will be required and will be requested by IBM Remote Technical Support. This is required further isolate and drill into specific areas of the features and code associated with the microsecond events leading up to the issue.
🔗

Question: Why can IBM not re-create my problem?

Answer:

Most modern computer environments consist of many connected computing, network, and storage devices, creating extremely complex solutions or infrastructures. This complexity exists whether these devices are provided by one vendor or multiple vendors.
What makes IBM Power solutions popular across all industries , are the products' capabilities to work with an infinite number of devices using an infinite number of configurations to support unique business transactions or workload. Issues that occur are due to a combination of the overall custom infrastructure, the custom configuration of the devices in the infrastructure, and the client unique business transactions or workload flowing across these multiple devices.
IBM, as with any vendor, simply does not have the resources and capability to re-create all situations for an infinite number of custom infrastructures, devices, configuration, and workloads. Some configurations may appear to be common practice, or the configurations may be specific to one vendor. None the less, the uniqueness of the overall solution (how the devices are connected and configured) and the uniqueness of the workload is enough to make it impossible for IBM to re-create the scenario.
The likelihood of IBM to re-create a scenario increases when issues can be isolated and confirmed to be related to specific features (for example, adapters, connections, subnets, processors).
🔗

Question: A problem occurred only once and can not be re-created, can IBM provide RCA?

Answer:

Refer to the Question "What are the common types of RCA?" to understand the types of RCA based on the accuracy and completeness of information and data provided by clients.
🔗

Question: What can clients do to help IBM?

Answer:

Use the following step-by-step instructions to contact IBM to open a case for software with an active and valid support contract.  

1. Document (or collect screen captures of) all symptoms, errors, and messages related to your issue.

2. Capture any logs or data relevant to the situation.

3. Contact IBM to open a case:

   -For electronic support, visit the IBM Support Community Page to View and Open Cases:
     https://www.ibm.com/mysupport
   -If you require telephone support, see the web page:
      https://www.ibm.com/planetwide/

4. Provide a clear, concise description of the issue.

 - For guidance, provide these answers in your description:

  1. What is your preferred communication method?
  2. List any error messages (provide full output)
  3. Describe the unexpected behavior
  4. List all applications possibly associated with the problem
  5. List all steps to reproduce the problem if possible
  6. When did the problem begin?
  7. Were any changes applied to the product, or system, which seemed to introduce this problem?

5. If the system is accessible, collect a system snap, and upload all the details and data for your case.

 - For guidance, select the Mustgather page for your product:

  1. IBM i Mustgather Instructions
  2. AIX, PowerHA (AIX), and PowerVM Mustgather Instructions
  3. Virtual HMC (vHMC) and HMC For 7042 and 7063 Mustgather Instructions
🔗

Question: An issue occurred is a result of a known or new defect, can we get RCA as to why it happened now or just to specific machines?

Answer:

If a known or new defect has been identified as the underlying cause of the reported issue, technically IBM has provided the RCA, which is a code defect contributing to the events that occurred. Not all defects all systems and applications, even when similar configurations are used. Issues that occur are due to a combination of the overall custom infrastructure, the custom configuration of the the devices in the infrastructure, and the client unique business transactions or workload flowing across these multiple devices. Although configurations and workloads may appear similar, there is uniqueness in the overall execution of the millions of events that may be occurring that will contribute to defects occurring for some systems and applications, and not others.
🔗

Question #2:This is a test 2

Answer:
here here
Click here to submit feedback for this document.

[{"Type":"MASTER","Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSPHKW","label":"PowerVM Virtual I\/O Server"},"ARM Category":[{"code":"a8m50000000L0FXAA0","label":"PowerVM VIOS"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB08","label":"Cognitive Systems"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"ARM Category":[{"code":"a8m0z0000001fMuAAI","label":"AIX General Support"}],"Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"","label":""},"Business Unit":{"code":"BU008","label":"Security"},"Product":{"code":"SGL4G4","label":"PowerHA"},"ARM Category":[{"code":"a8m3p000000hAumAAE","label":"PowerHA System Mirror"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
19 June 2024

UID

ibm16595181