IBM Support

Health Checks and Introduction to Troubleshooting on a Cloud Pak System

Troubleshooting


Problem

This document describes how the customer's system administrator checks the health and troubleshoots a problem on a Cloud Pak System.

Resolving The Problem

Customer System Administrators use this document to confirm the health of a Cloud Pak System.
IBM provides hardware health checking as an extra service offering. Contact your IBM Sales Representative. See the IBM Appliance Support Guide
 

Using the Health Check Report 

Click "Problem determination > System Troubleshooting". Click "System Health Report".

Notice the heading at the top of the report:

This report shows diagnostic data collected by a system health check. The information should be used to identify possible problems and they should be investigated before a PMR is opened.

The health check report shows "PASSED" or "FAILED" for each of the components in the IBM Cloud Pak System and your deployments.  Not all items marked as "Failed" require a MySupport case.   

Start by looking at each "FAILED" status. Work with your company's administers, business owners, and network engineers to see whether the failed status is expected, for example:

  • System HA or Cloud Group HA shows a failed state due to a repair to a compute node or due to the movement of compute nodes within cloud groups
  • The business owners are testing a deployment
  • The network engineers are changing the network

Next look at these sections:

Problems
"Created on" is usually the most helpful. With our team, investigate any problems.  You can close old resolved problems from "Problem determination > Problems".

The service ticket column is a case number, when Service and Support Manager (Call home) opened a case to IBM.  Check MySupport for previously reported problems.

Events
If you see a many events, close out all the resolved events.  The report is easy to use when the events are managed.  See: Managing events in IBM Cloud Pak System

The next to last column is "Last Occurrence" date.  The last column is "Count" that lists the number of times the event occurred. Identify new items to investigate. Close all resolved items. 

Trace settings
Check with the team managing the workloads on the IBM Cloud Pak System. The section reminds your team to turn of traces settings when debug is complete to improve performance.

Continue looking at the other information in the report.

Search for more information:
Documentation for events, troubleshooting, and topics on the product's features:

For resolution to known problems"
IBM's webpages at www.ibm.com

Checking Health from Cloud Pak System User Interface

Check Web Console Page Description
High Availability Status of the system Problem determination  > System Troubleshooting, then expand the Management High Availability section. Under Service Code, the overall status is "Online" when this feature is working properly.
Temperature Hardware > Infrastructure Map

Click "Temperature" in the top menu bar.

Check the temperature values for the system.
Click the image of the component for details.

Component Status Hardware > Infrastructure Map

For the components in the system, check the status shown on the web page.

When there is a red "!" , in the picture of the system, click the icon. The "Default section" is presented with more information. When there are numbers after the red "!", click the number and review the events posted. Check the "Updated on" column to see when the event was posted.
 
Hardware > Infrastructure Map

Click "Switch to Tree View"

Flex Chassis >

Chassis Management Module

Look for any alerts and review the information shown on the web page.

When there is a Flex chassis, click each Chassis in the system.

Click any numbers after Errors or Warning Events => 1. Review the health statistic and other information on these pages.

Hardware > Compute Nodes Click any numbers after Errors or Warning Events => 1. Review the health statistic and other information on these pages.

Check that none of the nodes are in quiesced or stopped state.  The compute nodes are powered on and available, unless there is a problem, or system administrators set the state otherwise.

Hardware >

Management Nodes

Check the management nodes to make sure they are powered on and available.
Click any numbers after Errors or Warning Events => 1

One of the PSMs is marked as the leader with "Platform System Manager -Primary " in the "Type" field.
 
Hardware > Storage Devices
Look at all the Storage Nodes and Storage Node Expansions to make sure the Disk Drives, LUNs, and Storage controller ports are all available for each node.
Problems Problem Determination > Problems
Problems identified by the system are listed.
You can sort problems by any of the column headers but the "Created on" date is usually the most helpful. Investigate any problems. 

Problems are not automatically deleted from this view. It is important to close them on resolution.
Problems are often (but not always) associated with Events. There can be some overlap between information in this view and that presented in the Problem Determination > Event view.
Events Problem Determination > Events
Events identified by the system are listed.

You can sort Events by any of the column headers but the "Updated on" and "Severity" are usually the most helpful. Some of the events are from problems that are experienced by the workload users and are critical to that workload but not to the entire system.

Job Queue System > Job Queue.
You want to ensure that jobs are starting and finishing on this page.

Check the Started Queue for failures.

Click the box to show internal jobs to get a complete picture of the job queue. It is quickest to sort by Status pulls all failures to the top.
Review failures of individual deployment jobs with the team managing the system workload.
Scan the information to look for pervasive system issues here.
Failed Deployments
Patterns >
{Virtual System Instances, Virtual System Instances (Classic), Virtual Application Instances }
Look over the list of all instances.

Many deployments in failed or incomplete initializing status indicate system-wide problems.
Confirm with the team managing the workloads that the deployment projects are in the expected state.
Validate Shared Services Patterns > Shared Services
Validate that shared services are running for all appropriate cloud groups and appear to be in good health. Work with the local patterns administrator to confirm the list of shared services. 

NOTE: It is not unusual for shared services to be stopped or not deployed for various reasons depending upon the use of the system. Check with the patterns administrator on the expected status of shared services.

Troubleshooting

Topics on the product's features and troubleshooting:

  1. Check "Problem determination > Problems". Review the list.
    • Problems identified by the system are listed. You can sort problems by any of the column headers. The "Created on" column is usually the most helpful. Investigate any problems that are reported. When the issue is resolved or determined to not be a problem, close the item.
    • Problems are not automatically deleted from this view. It is important to delete entries on resolution. Problems are often (but not always) associated with Events. There can be some overlap between information presented.
    • Search the support portal for an explanation of the message in the product documentation center, a technote, or APAR document.
  2. Check the event log: "Problem determination > Events".
    • If the event ends with an "I", with "W", the event is a warning for the system administrators.  Events that end with "E" need to be checked by the system administrators, and business owners for the workloads, for example:

      • IBM Cloud Pak Systems posts "E" messages if the system administrators move compute nodes from the System HA cloud group, These messages are expected. 

      • IBM Cloud Pak System posts messages or call home if a battery was removed during a service call for a storage node.  This message is expected.

    • Check with your system admin team and business owners for the workloads.  The actions they take result in events, especially "I" and "W" messages.

    • Some of the Events are from problems experienced by the workload users and are critical to that workload but not to the entire system. Check with the workload users to see whether they are testing or debugging.

    • Export the event to a file.  On the line for the events, there is a set of icons. Click the "circle and arrow" icon to export the errors to a file.  Study the file to determine when the event first started and if the event stopped occurring. Check "Problem determination > Events". Look for a follow-on event noting the event is resolved.
    • Look at the "Event Type". If the event type shows one of the hardware components in the IBM Cloud Pak System,
      Navigate to "Hardware > Infrastructure Map".
      On the upper left corner of the page, click "Switch to Tree View".
      Click "hardware component name".  Follow the instructions in the "Checking Health from Cloud Pak System User Interface" in this document.
    • Search the support portal for an explanation of the event or message in the product documentation, a technote, or APAR document.
    • Search www.ibm.com for information. 
  3. Check the job queue: "System >Job Queue"
    • Under "Started Jobs Queue", check that jobs are running.  If not, check the "Display Internal Jobs" box. Refresh the screen. Click "double arrows" under the "Started Jobs Queue" line. Confirm that there were internal jobs running, and successfully completing. Wait 10 minutes. If there are no internal jobs running or completing, save a capture of this screen.
    • Look at the "Status" column for failed jobs.  Wait and check this page again as you want to ensure that jobs are moving in the Started Queue and that there are not a many failures.  Failures of individual deployment jobs are usually not a concern unless all deployment jobs seem to be failing.  
    • A "Pending" job is queued up to run at some point in the future.  The "Internal Backup Job" is queued and pending to run every day.
    • Check the storage nodes: "Hardware > Storage Devices". Look at all the Storage Nodes and Storage Node Expansions to make sure the Disk Drives, LUNs, and Storage controller ports are all available for each node.
  4. Check the Management Nodes:  "Hardware > Management Nodes". Look at the status of elements and events associated with those elements. 
  5. Check the Management High Availability Status of the PSMs
    • Click "Problem determination > System Troubleshooting", expand the "Management High Availability" section.
    • In the table, there are service code and status columns. The status column shows a status of "Online" when this feature is working properly.
  6. Check Chassis: "Hardware > Flex Chassis".  Open the selection. There are 2 values: "Ambient Temperature" and "Maximum Ambient Temperature".  Hover over the 'temperature icons' for these fields to make sure the temperature is within range. If there is a red "!" , in the picture of the system, click the icon. The "Default section" is presented with more information.
  7. Check the Network Switches:  "Hardware > Network Devices". Click each switch.
  8. Check the DNS connectivity: "System > System Settings". Expand "Domain Service (DNS)". Use "Lookup host name or IP address". Confirm you can connect to the expected IP addresses by IP and hostname. Consult with your team managing the DNS server for advice.
  9. Check for failed deployments: "Patterns > {Virtual System Instances, Virtual System Instances (Classic), Virtual Application Instances, Shared Services Instances}".
    • A few failed or error deployments depend on the way that these systems are used. Confirm all is well with the team working on these projects.
    • A many deployments in failed or stalled states can indicate system-wide problems.
  10. Consult the troubleshooting chapters in the IBM Cloud Pak System product documentation for next steps: 
  11. Otherwise, if you see other problems you would like IBM to investigate, use the organize your problem information technote, then this technote: Contacting IBM Cloud Pak System and Cloud Pak System Software Technical Support

Health checking with the IBM SSR during a service call

Together the IBM SSR, your system administrators, and data center team confirm the problem is resolved, before the IBM SSR leaves your datacenter.

The IBM SSR checks the physical status of the component. System Administrators check the IBM Cloud Pak System user interface to confirm the problem is resolved.

Navigate to the component  "Hardware" > "Component", look for any number after Error or Warning, click the number to see if there are any new problems. Scroll though the page to make sure all is as expected. Look for any number after Error or Warning, click the number to see if there are any new events. Scroll though the page, check the Status is "Available", LEDS are off (except for the power LED), fans show no problems and the power supplies are working.

Open the System Health check report. "Problem determination > System Troubleshooting". Click "System Health Check".

You would look for the component being repaired to make sure it is not in a failed state.

See Using the Health Check Report section in this document to confirm system or group high availability and other components are online.

[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSJPC5","label":"IBM Cloud Pak System W3700"},"ARM Category":[{"code":"a8m0z000000cwm2AAA","label":"Product Components"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions","Type":"MASTER"},{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS97LW","label":"IBM Cloud Pak System W3550"},"ARM Category":[{"code":"a8m0z000000cwm2AAA","label":"Product Components"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions","Type":"MASTER"},{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS97LW","label":"IBM Cloud Pak System W3550"},"ARM Category":[{"code":"a8m0z000000cwm2AAA","label":"Product Components"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions","Type":"MASTER"},{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSTQRPX","label":"IBM Cloud Pak System W4600 Commercial for VMware"},"ARM Category":[{"code":"a8m0z000000cwm2AAA","label":"Product Components"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions","Type":"MASTER"}]

Document Information

Modified date:
16 June 2022

UID

swg21675742